Perl Help

Discussion in 'OT Technology' started by unrealii, Aug 27, 2004.

  1. unrealii

    unrealii professor of plant biology

    Joined:
    May 6, 2001
    Messages:
    2,037
    Likes Received:
    0
    Location:
    So CALI
    I'm trying to write a crawler where I can extract all urls off of a web page. I was able to find some sample code on the internet and modify it to work with my input and output files, however, the code extracts all urls from the website including images, and mailto's. Any way I can filter those out?

    My code is very similar to:
    http://www.cs.utk.edu/cs594ipm/perl/crawltut.html

    I have been looking through the modules I am using to find something which will only extract what I want, but I haven't found anything. The closest I found was this http://iis1.cps.unizar.es/Oreilly/perl/cookbook/ch20_04.htm
     
  2. unrealii

    unrealii professor of plant biology

    Joined:
    May 6, 2001
    Messages:
    2,037
    Likes Received:
    0
    Location:
    So CALI
    nevermind, that code from the oreilly book helped. Images are gone unless they are linked in a href form. Still have mailto's to take care of. I can probably tell it to ignore those in the downloader file.
     

Share This Page