Another file parsing challenge - what tool to use?

Discussion in 'OT Technology' started by Astro, Feb 26, 2003.

  1. Astro

    Astro Code Monkey

    Joined:
    Mar 18, 2000
    Messages:
    2,047
    Likes Received:
    0
    Location:
    Cleveland Ohio
    Ok, I've got another challenge...

    I need to take a boat load of HTML files and drop out all text up to and including the body tags, but keeping all the content between the body tags in tact (including all existing HTML tags).

    If done correctly, this should be a one time pass.

    Some options:

    1. By hand. Not really an option, but its always there.

    2. Regular expressions. I'm a big fan of them, but I'm thinking this might be a bit much for one regular expression. People are welcome to argue this point. Tools like Edit+ or PowerGrep would be likely canidates to execute the expression.

    3. DIY using simple parsing rules such as find body tag, pull in all text until close body tag and overwrite original file. Not tough. I'm thinking Perl might be best for this approach (don't need GUI). Java would be a close second. VB is not an option. C++ is probably overkill for this. I would kill myself if handed Assembler to do this (although it would go really fast!). Performance and looks are not high on the requirement list. Development time is.
     
  2. thewise1

    thewise1 Guest

    vbscript or perl would both be good options.
     
  3. CompiledMonkey

    CompiledMonkey New Member

    Joined:
    Oct 26, 2001
    Messages:
    8,528
    Likes Received:
    0
    Location:
    Richmond, VA
    Option 3 using Perl or Java would probably work best. Dev time shouldn't be much in Java, I'm not sure about Perl.
     
  4. SLED

    SLED build an idiot proof device and someone else will

    Joined:
    Sep 20, 2001
    Messages:
    28,118
    Likes Received:
    0
    Location:
    AZ, like a bauce!
    hmmm, i've written many o' text parsers in my day, and i would say that this should be cake if it is only one condition, being the "<body>" and "</body>" tags. Perl i hear is really good for that kind of stuff. I hear people rave about this all the time (although no personal experience). Java would be my choice (assuming .NET is not an option). I've written an algebra recurrsive decent parser in Java before, and must say that the string functions are really nice, but no better than they are in .NET
     
  5. Astro

    Astro Code Monkey

    Joined:
    Mar 18, 2000
    Messages:
    2,047
    Likes Received:
    0
    Location:
    Cleveland Ohio
    Microsoft solutions are pretty much out of the question.

    I'm leaning towards Perl and Java to get the job done.

    As for the body tag, 3/4 of the tags appear to be alone, but the other 1/4 are loaded up with all sorts of parameters. This might not be a problem:

    I'm positive I can come up with a regular expression to pick off a body tag with or without parameters as well as its contents. Using the regular expression variables, I can dump the captured string (all the stuff between the body tags) out to the file (overwriting it).

    In theory:

    - read file in to a string
    - parse with a magic regexp
    - capture contents of body tag
    - dump this string back to the file, overwriting it.
    - rinse and repeat

    I have a feeling the Perl code won't be too much longer then my list above. I have limited experience with Java, but have a feeling its going to introduce added complexity which I don't really need here.

    Now to make things more interesting, what about the use of an XML parser? The trick would be to fool the parser in thinking the HTML file was valid XML. But in theory, you could tell the parser to go find the body tag and return its contents (ignore the body tag's parameters).
     
  6. Astro

    Astro Code Monkey

    Joined:
    Mar 18, 2000
    Messages:
    2,047
    Likes Received:
    0
    Location:
    Cleveland Ohio
    Perl did the trick (14 lines of code - read file, parse, test, write file)

    Without the file opening/closing/reading, here's the meat of the program:

    Code:
    $string =~ /(<body.*?>)(.*?)<\/body>/igs;	
    Then in variable $2 is the data I'm looking for (all the HTML junk between the body tags).

    The more policitally correct way to do this is to use the HTML::parser library, but the body tag is a pretty simple tag.
     
  7. SLED

    SLED build an idiot proof device and someone else will

    Joined:
    Sep 20, 2001
    Messages:
    28,118
    Likes Received:
    0
    Location:
    AZ, like a bauce!
    That's sweet. I'm want to start looking at Perl now.
     
  8. Astro

    Astro Code Monkey

    Joined:
    Mar 18, 2000
    Messages:
    2,047
    Likes Received:
    0
    Location:
    Cleveland Ohio
    I'm positive this could be done in Java. But this seems to work very well (maybe not as glamourous as Java).

    Tomorrow I think I get to figure out how to do Perl directory filtering. I have the script setup to take 1 argument (the file name). This works great. But I think I'm going to set it up to accept *.* wildcarding (going to have to have this script chew on about 1200 files).
     
  9. CompiledMonkey

    CompiledMonkey New Member

    Joined:
    Oct 26, 2001
    Messages:
    8,528
    Likes Received:
    0
    Location:
    Richmond, VA
    Isn't it just more fun to write "throw away code"? :bigthumb: I like doing quick and dirty spikes more than the actual project sometimes. :o
     

Share This Page