best way to capture text for summary from an HTML page?

Discussion in 'OT Technology' started by babygodzilla, Sep 12, 2008.

  1. babygodzilla

    babygodzilla I love rice

    Joined:
    Nov 5, 2001
    Messages:
    3,108
    Likes Received:
    0
    hey guys,

    i need to make a javascript that will capture reasonably good text for a variety of HTML pages. i'm not sure if this can be done or what the best way to do it is. the text will be contained in myriad of tags. it could be in a <div> tag, a <p> tag, a <span> tag, etc etc. what's the best way for the script to identify a block of text, strip the markups, and grab that text?


    thanks!
     
  2. Peyomp

    Peyomp New Member

    Joined:
    Jan 11, 2002
    Messages:
    14,017
    Likes Received:
    0
    What are you trying to do?
     
  3. babygodzilla

    babygodzilla I love rice

    Joined:
    Nov 5, 2001
    Messages:
    3,108
    Likes Received:
    0
    like a crawler that will crawl HTML files in a certain folder and get as much info about the page by extracting text from it, without the markup of course.
     
  4. Peyomp

    Peyomp New Member

    Joined:
    Jan 11, 2002
    Messages:
    14,017
    Likes Received:
    0
    So you want to get a large pile of words, minus the tags?

    That shouldn't be too hard using the DOM in JS. I suggest you setup something to walk the DOM and dump out values and then work on filtering stuff you don't want until you arrive at a good output.
     
  5. White Stormy

    White Stormy Take that, subspace!

    Joined:
    Sep 17, 2002
    Messages:
    85,489
    Likes Received:
    70
    Location:
    Sparkopolis

Share This Page