Perl help...

Discussion in 'OT Technology' started by crazybenf, Apr 20, 2005.

  1. crazybenf

    crazybenf Active Member

    Joined:
    Nov 14, 2001
    Messages:
    15,575
    Likes Received:
    2
    I have shit for skills in perl.. This is what I'm trying to accomplish:



    I have 18,203 text files that contain a LOT of garbage data, then mixed in is a string that I want to extract.

    the string looks like this:



    <html shit><html shit><sdsdlf 101010101010-0101.XXX">dlfgjdsglkjsd<sdflkgdsfg
    <html shit><html shit><sdsdlf 10345784501010-0101.XXX">dlfgjdsglkjsd<sdflkgdsfg
    <html shit><html shit><sdsdlf 1023234324010-0101.XXX">dlfgjdsglkjsd<sdflkgdsfg
    <html shit><html shit><sdsdlf 101017676710-0101.XXX">dlfgjdsglkjsd<sdflkgdsfg
    <html shit><html shit><sdsdlf 1012919210-0101.XXX">dlfgjdsglkjsd<sdflkgdsfg



    Each file has 1-100 instances of the text I want to extract. The number sequence before the .XXX is random. I want all 18,000 files to be scanned, with the 0000000000000-00000.XXX extracted then appended with the File name that it came from in ONE big text file.


    ie:



    1012919210-0101.XXX blahblahblah.txt
    1032453210-0101.XXX blahblahblah2.txt
    1045439210-0101.XXX blahblahblah3.txt
    1012345410-0101.XXX blahblahblah3.txt
    1012232392-0101.XXX blahblahblah3.txt
    1212311210-0101.XXX blahblahblah3.txt
    1041254920-0101.XXX blahblahblah3.txt





    I've been toying with sed/awk on this for two days now, and I don't think it can be done with those tools.

    $5.00 paypal to the ninja that can get this done for me. :)
     
  2. samm

    samm Next in Line

    Joined:
    Dec 22, 2000
    Messages:
    2,630
    Likes Received:
    0
    Location:
    San Jose, CA
    you want to use regular expressions:

    [0-9].-[0-9].\.[0-9].

    That should do what you want. Writing a perl script to do this and output everything to a master file shouldn't be too difficult.
     
  3. Joe_Cool

    Joe_Cool Never trust a woman or a government. Moderator

    Joined:
    Jun 30, 2003
    Messages:
    299,249
    Likes Received:
    538
    You can do anything with sed and awk. ;)

    So to be clear, in your example above, you want to end up with a huge text file containing this?

    blahblahblah.txt
    blahblahblah2.txt
    blahblahblah3.txt
    blahblahblah3.txt
    blahblahblah3.txt
    blahblahblah3.txt
    blahblahblah3.txt


    Edit: D'OH. Nevermind. Reading comprehension > me.

    I'm about to go to bed now, but if somebody hasn't done this by midmorning, I'll give it a shot.
     
  4. crontab

    crontab (uid = 0)

    Joined:
    Nov 14, 2000
    Messages:
    23,441
    Likes Received:
    12
    Your example is too vague to do any type of perl. Posting the exact "html shit", will help. Since what part you want to rip contains a fixed amount of random numbers I'm assuming, just searching for that number is almost not possible.

    Is "-0101.XXX" unique? Anything else within that line unique?

    Then why did you post:

    1012919210-0101.XXX blahblahblah.txt
    1032453210-0101.XXX blahblahblah2.txt
    1045439210-0101.XXX blahblahblah3.txt
    1012345410-0101.XXX blahblahblah3.txt
    1012232392-0101.XXX blahblahblah3.txt
    1212311210-0101.XXX blahblahblah3.txt
    1041254920-0101.XXX blahblahblah3.txt

    Those are multiple blah text files?

    Well. I'm assuming -0101.XXX is unique. And I'm assuming all these files are *.xml, and I'm assuming that you want to output to ONE big file here's a crude awk way.

    grep "\-0101.XXX" *.xml | awk '{print $4}' | awk -F\" '{print $1}' > ONE.big.file

    or depending on the OS if 18K files too many arguments for the cli:

    ls | grep xml | xargs -i grep "\-0101.XXX" {} | awk '{print $4}' | awk -F\" '{print $1}' > ONE.big.file
     
  5. crontab

    crontab (uid = 0)

    Joined:
    Nov 14, 2000
    Messages:
    23,441
    Likes Received:
    12
    Did this help? Or did you find another solution.
     

Share This Page