WEB scraping program?

Discussion in 'OT Technology' started by airbball23, Mar 16, 2009.

  1. airbball23

    airbball23 Rent this Space only $5/mnth

    Joined:
    Jan 13, 2007
    Messages:
    1,489
    Likes Received:
    0
    is there any program where i can get/download where i can scrape info from sites that do not offer rss feed and possibly scrap it often? For example let's say a breaking news story comes up on CNN i can scrape it asap or scrape ever 5 mins?
     
  2. intrktevo

    intrktevo New Member

    Joined:
    Oct 18, 2004
    Messages:
    5,781
    Likes Received:
    0
    Location:
    UCF
    it's possible, but finding a single script to scrape any random website doesn't exist afik. you'll have to get someone to create a custom script to scrape whatever site you're looking to get data from
     
  3. Supergeek

    Supergeek New Member

    Joined:
    Jan 23, 2007
    Messages:
    1,855
    Likes Received:
    0
    Location:
    Colorado
    Google PHP, Perl, and CURL.
     
  4. ROFL

    ROFL New Member

    Joined:
    Dec 27, 2004
    Messages:
    6,862
    Likes Received:
    0
    Location:
    England
    http://forums.offtopic.com/showpost.php?p=112721317&postcount=21 has code that will scrape all the html and this code was posted some time ago, not sure if it works though but it maybe useful for you to grab content between div tags.

    Code:
    <?php
    function GrabInBetween($StartTag, $EndTag, $File){
              $Contents = explode($StartTag, $File);
              $Contents = explode($EndTag, $Contents[1]);
              return $Contents[0];
         }
    
    $curl_handle=curl_init();
    curl_setopt($curl_handle,CURLOPT_URL,'http://www.site.com/test/');
    curl_setopt($curl_handle,CURLOPT_CONNECTTIMEOUT,2)  ;
    curl_setopt($curl_handle,CURLOPT_RETURNTRANSFER,tr  ue);
    $result = curl_exec($curl_handle);
    curl_close($curl_handle);
    
    if (empty($result)){
        print "Can't connect with curl";
    }else{
        $OpenTag = '<div class="gbox-dotd-container">';
        $CloseTag = '</div>';
        $done = GrabInBetween($OpenTag, $CloseTag, $result);
        print $done;
    }
    ?>
    
     
  5. Mikenotmike

    Mikenotmike

    Joined:
    Jun 1, 2001
    Messages:
    6,244
    Likes Received:
    0
    Location:
    USA
    http://www.autoitscript.com/

    you can scrape anything with it.. Kind of weird to learn at 1st but really not that difficult if you have any programming background.

    has to be run locally though
     
  6. Ricky

    Ricky █▄ █▄█ █▄ ▀█▄

    Joined:
    Jun 17, 2005
    Messages:
    38,767
    Likes Received:
    6
    i have a thread on it somewhere here
     
  7. drpepper

    drpepper Active Member

    Joined:
    Nov 13, 2006
    Messages:
    38,076
    Likes Received:
    2
    Location:
    San Antonio
    it works, but how would you style the results?
     
  8. ge0

    ge0 New Member

    Joined:
    Oct 31, 2005
    Messages:
    8,398
    Likes Received:
    0
    Location:
    JERSEY
    you know in PHP, you can do a explode on a file = http://www.lol.com then do a preg_match
     
  9. airbball23

    airbball23 Rent this Space only $5/mnth

    Joined:
    Jan 13, 2007
    Messages:
    1,489
    Likes Received:
    0
    ohsnaps! thanks
     
  10. airbball23

    airbball23 Rent this Space only $5/mnth

    Joined:
    Jan 13, 2007
    Messages:
    1,489
    Likes Received:
    0
    this is a little harder then i thought.

    alright here's another example: http://www.hot97.com/BroadcastHistory.aspx

    let's say i want to scrape the last songs played - the songs are in a <tr class> and for each class they have certain codes such as "trdef".

    Now i tried to fiqure out how to do it in autoit but i'm clueless.

    i tried scraping data with google spreadsheet cause it's pretty easy to scrape stuff like wikipedia articles which have table - ie - http://ouseful.wordpress.com/2008/10/14/data-scraping-wikipedia-with-google-spreadsheets/ but since this has no tables but a tr class i was having difficulties scarping the data.

    how would i set up code for this code (listed from above):
    for opentag do i put in the tr class instead of <div class="gbox-dotd-container"> ??
     
  11. hank85

    hank85 sudo shred /dev/sda -f -v -z --iterations=6

    Joined:
    Jul 23, 2008
    Messages:
    4,360
    Likes Received:
    0
    Yes, you would need to modify both the open and close tags for that particular site and any other site you wish to use this script with.

    I'll modify the above script tomorrow to use with this site and any other site that you'd like to scrap that has multiple bits of informtion with varying containers. I'll post a reply in here along with a new thread in on the board.

    Been seeing a lot of interest in a php scraper lately.


    Edit: Not gonna post this anymore. Someone put a purchase bid on the script. Sorry guys.
     
    Last edited: Mar 23, 2009
  12. Mikenotmike

    Mikenotmike

    Joined:
    Jun 1, 2001
    Messages:
    6,244
    Likes Received:
    0
    Location:
    USA
    I'm interested in seeing this thread develop. I did a lot of this with that Autoit I mentioned earlier.. There's a whole library (called ie.au3) of functions someone made that pretty much allow you to do ANYTHING with respect to scraping.. there's must be a php version of this somewhere I would think..?

    examples of the autoit functions
    http://www.autoitscript.com/forum/index.php?showtopic=25629
     
  13. airbball23

    airbball23 Rent this Space only $5/mnth

    Joined:
    Jan 13, 2007
    Messages:
    1,489
    Likes Received:
    0
    and a really newb question - how do i run the script? like yeah it's in php but do i put it on my webhost and then how do i execute the file? can i do it locally to see if it works and then implement in a web sever into a MySQL database using PHP?

    i guess i need to start learning PHP 101 lol
     
  14. airbball23

    airbball23 Rent this Space only $5/mnth

    Joined:
    Jan 13, 2007
    Messages:
    1,489
    Likes Received:
    0
    sounds cool thanks mang! i really want to see ohw this is done.

    the www.popurls.com probably uses RSS feeds but idk - i like how he is neat and organized with his shit. this is one code:

    i dont see any RSS in there :hsugh: but i do see he uses a lot of javascripts. idk what do you guys think? he also keeps an archive of past top stories - would this be in his MYSQL database or something where he would scrape them and then archive them for later use?
     
  15. FartLighter

    FartLighter Resident Fart Expert OT Supporter

    Joined:
    Jul 5, 2005
    Messages:
    2,853
    Likes Received:
    9
    Location:
    Mammoth Lakes, CA
    Anybody use Python for their web scraping? It is all I use (used to use Perl). I thought it was the most common, as well as cURL. :hs:
     
  16. intrktevo

    intrktevo New Member

    Joined:
    Oct 18, 2004
    Messages:
    5,781
    Likes Received:
    0
    Location:
    UCF

    if your file is file.php, upload it to your host, and go to http://yourdomain.com/file.php

    you can set up a local testing environment easily with http://www.apachefriends.org/en/xampp.html
     

Share This Page