WEB creating a feed from a feed?

Discussion in 'OT Technology' started by burn__, Jun 12, 2008.

  1. burn__

    burn__ New Member

    Joined:
    Mar 21, 2006
    Messages:
    10,673
    Likes Received:
    0
    i have no idea where to start on this. the websites at our dealership displays all of its information through a feed provided by a 3rd party.

    http://www.cdmdata.com/cdmdigitallot/vehicles.asp?dealerid=9751

    i dont have access to their database, so is there a way i can basically datamine that feed theyve created for us so that i can pull out random thumbnails/models to display on a timer?

    basically what im trying to do is have a 4 car vertical list on another one of our websites that will display a random set of 4 cars from our inventory (link posted above) each time the page is loaded. it wouldnt be too bad to hardcode it but knowing them they're going to want me to change the cars 3-4 times a day and i want to avoid that if possible. :hs:
     
  2. Josh

    Josh Guest

    what is the feed url
     
  3. burn__

    burn__ New Member

    Joined:
    Mar 21, 2006
    Messages:
    10,673
    Likes Received:
    0
    all i have is http://www.cdmdata.com/cdmdigitallot/vehicles.asp?dealerid=9751 which is already laid out into their template.

    the only way i can think to do it is to build a bot to datamine that site every morning to get updated inventory, but thats a whole new set of problems. was just hoping there was a way to extract the basic information without having the direct feed that CDM uses to fill their template.
     
  4. Josh

    Josh Guest

    It would be much simpler to actually have the feed URL. If you're building the site for them why wouldn't they give it to you?
     
  5. burn__

    burn__ New Member

    Joined:
    Mar 21, 2006
    Messages:
    10,673
    Likes Received:
    0
    the company we use to manage our inventory is 3rd party, and they threw a shit fit last time someone tried to get their feed because its "unauthorized access to their DB". autotrader does the same thing with their feeds, they dont let the dealers have "direct access" to it because they want to charge us to have them build a template and all that. was trying to see if there was a way around it, but havent come across anything simple yet. :mamoru:
     
  6. Josh

    Josh Guest

    Data mining would work if you're good at that sort of thing.
     
  7. burn__

    burn__ New Member

    Joined:
    Mar 21, 2006
    Messages:
    10,673
    Likes Received:
    0
    ive never built anything like that from scratch though, so thatd be a big jump for me for such a small project :hs:
     
  8. Limp_Brisket

    Limp_Brisket New Member

    Joined:
    Jan 2, 2006
    Messages:
    48,422
    Likes Received:
    0
    Location:
    Utah
    you could data mine that page, here are some problems i've noticed though. it's multi-paged so if you just data-mined that main link you would always get those cars on the first page which seems to be sorted by year (or title, which starts with the year). this means you'd have to randomly generate a valid page number and then you could grab as many cars as you wanted from that page. (if you wanted it to select more cars than just those on the first page)

    the problem with that is page number is not read through the query string, instead they have some obfuscated javascript (bastards) to load the page through a POST request. instead of trying to decipher that code i loaded up wireshark and you can see all the data they send.

    DealerID=9751&Wlot=False&location=&age=0&did=9751&mmBodies_Hidden=&mmMakes_Hidden=0&mmModels_Hidden=&srch=s&mmType=used&mmMakes=0&category=0&transmission=&extcolor=&bprice=0&eprice=0&bmileage=0&emileage=0&byear=0&eyear=0&stockid=&srchvin=&SortBy=YEAR&SortOrder=ASCENDING&PageNo=4

    so you might be able to generate your own post request to grab the data, however they might have some kind of verifier to make sure the request is theirs, such as a referer check, cookies, server session, or hash (which i don't see in there).

    so in otherwords, it's possible, but would take some work. data-mining the page with regexp's would be slower than just reading a feed, probably not too slow though. however if they find out what you're doing you might be breaking some ToS or something and they could throw a fit.
     
  9. burn__

    burn__ New Member

    Joined:
    Mar 21, 2006
    Messages:
    10,673
    Likes Received:
    0
    :bowdown:
    ill have to look into generating my own post request. didnt even know i could do that :o im still going to try to see if i can get their direct feed first, but this seems like a viable second option if they dont want to let it go. thank you!
     
  10. Josh

    Josh Guest

    I tried to scrape it, but it won't let me open the page in PHP. I've never mined an ASP page, so maybe I'm missing something...

    http://weborl.com/carScrape/
     
  11. Limp_Brisket

    Limp_Brisket New Member

    Joined:
    Jan 2, 2006
    Messages:
    48,422
    Likes Received:
    0
    Location:
    Utah
    hah you know what, if you put Pageno=# in the query string it'll still work, so it doesn't have to be sent through post, that's just the way they do it.
     
  12. Josh

    Josh Guest

    i noticed that, makes for an easy loop and mine, but any idea why the page generates that error? ^^^^ the output is from their site
     
  13. Limp_Brisket

    Limp_Brisket New Member

    Joined:
    Jan 2, 2006
    Messages:
    48,422
    Likes Received:
    0
    Location:
    Utah
    well, here is the GET request that firefox made that works:
    Code:
    GET /cdmdigitallot/vehicles.asp?dealerid=9751 HTTP/1.1
    Host: [URL="http://www.cdmdata.com"]www.cdmdata.com[/URL]
    User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.14  Gecko/20080404 Firefox/2.0.0.14
    Accept:text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
    Accept-Language: en-us,en;q=0.5
    Accept-Encoding: gzip,deflate
    Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
    Keep-Alive: 300
    Connection: keep-alive
    Cookie: CDMVehicleBrowser=e1be60e9-a7bb-48d8-8b72-d0b5c91c645d; ASP.NET_SessionId=00hpa4454feeek55zlqt2wrs; BIGipServerpool.[URL="http://www.cdmdata.com=1678770186.20480.0000"]www.cdmdata.com=1678770186.20480.0000[/URL]
    
    and here is the GET request that php makes with the file() command
    Code:
    GET /cdmdigitallot/vehicles.asp?dealerid=9751 HTTP/1.0
    Host: www.cdmdata.com
    
    as you can see a lot of information is missing and apparently the ASP page doesn't like that. I'm guessing if you used the pear HTTP_Client (or something similiar) to make your own GET request that looked more like the first one that it would work.

    edit - i was going to make my own socket connection and message to see if it'd work but my boss came in and gave me a task =( maybe i'll try it later.
     
    Last edited: Jun 12, 2008
  14. burn__

    burn__ New Member

    Joined:
    Mar 21, 2006
    Messages:
    10,673
    Likes Received:
    0
    yeah...i think im in waaaay over my head because i barely understand any of this :rofl:
     
  15. Limp_Brisket

    Limp_Brisket New Member

    Joined:
    Jan 2, 2006
    Messages:
    48,422
    Likes Received:
    0
    Location:
    Utah
    ok, i made my own socket request in php and sent my own GET request like this:
    Code:
    GET /cdmdigitallot/vehicles.asp?dealerid=9751 HTTP/1.1
    Host: www.cdmdata.com
    User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14
    
    effectively spoofing the user-agent and it worked. i don't know if there's some way to change that with a standard php file() request or not, but if so, that'd fix that problem.

    edit- ok, i found an easier solution than creating your own socket, you can change your user-agent in your ini to make file() work like this:

    Code:
    ini_set('user_agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14');
    $file = file('http://www.cdmdata.com/cdmdigitallot/vehicles.asp?dealerid=9751');
     
    Last edited: Jun 12, 2008
  16. Limp_Brisket

    Limp_Brisket New Member

    Joined:
    Jan 2, 2006
    Messages:
    48,422
    Likes Received:
    0
    Location:
    Utah
    here, this successfully scrapes a random page 0-9 and grabs all the cars putting them in an array $cars.

    Code:
    <?
        $page = mt_rand(1,9);
        ini_set('user_agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14');
        $content = file_get_contents('http://www.cdmdata.com/cdmdigitallot/vehicles.asp?dealerid=9751&Pageno='.$page);
        
        $s = '.+?';
        $title = '<p class="title">(.+?)</p>';
        $img = '<img src="([^"]+)';
        $stock = 'Stock#: </span>([^<]+)';
        $engine = 'Engine: </span>([^<]+)';
        $mileage = 'Mileage: </span>([^<]+)';
        $color = 'Color: </span>([^<]+)';
        $price = '<span class="price">([^<]+)</span>';
        
        $regex = "~".$title.$s.$img.$s.$stock.$s.$engine.$s.$mileage.$s.$color.$s.$price."~si";
        
        preg_match_all($regex,$content,$cars);
        array_shift($cars);
        print_r($cars);
        exit;
    ?>
     
  17. burn__

    burn__ New Member

    Joined:
    Mar 21, 2006
    Messages:
    10,673
    Likes Received:
    0
    :eek3:
    you are the man! this is exactly what i wanted to do! i gave up on trying to do this because i had absolutely no clue :rofl:

    if youre ever in phx, i owe you a beer! :h5:
     
  18. wayno

    wayno New Member

    Joined:
    Mar 12, 2008
    Messages:
    540
    Likes Received:
    0
    Location:
    San Antonio, Texas
    Limp_Brisket, nice code! Forgive the neophite question, but does this assume there will always be nine pages of data to glean from? What if inventory is down at burn's dealership, and only eight pages are available?
     
  19. Limp_Brisket

    Limp_Brisket New Member

    Joined:
    Jan 2, 2006
    Messages:
    48,422
    Likes Received:
    0
    Location:
    Utah
    yes it does, although it's easily changeable. i figured the only other option would be to scrape the page for the total page count and THEN scrape a random page, but the idea just seemed too slow. if you put in a higher page then exists that page will just go to the last page anyways, so if it drops to 8 pages it just means you'll end up with more cars showing up from the 8th page if you don't change the number.
     
  20. wayno

    wayno New Member

    Joined:
    Mar 12, 2008
    Messages:
    540
    Likes Received:
    0
    Location:
    San Antonio, Texas
    Very nice. Thanks for the explanation.
     
  21. Josh

    Josh Guest

    You could also put the code into a function and have it use an expression to count the pages... return the page count and loop through 2 to $pageCount to run the function again.

    Can you explain a bit on how those regular expressions work? I've always done it a bit diff / longer method...
     
  22. Limp_Brisket

    Limp_Brisket New Member

    Joined:
    Jan 2, 2006
    Messages:
    48,422
    Likes Received:
    0
    Location:
    Utah
    what do you mean? how regular expressions work in general or just the way I did it? how regular expressions work could fill a small book, but i'll try and explain simply the one i made.

    $s = '.+?';
    . matches any character, + quantifies the any character to be 1 or more of any character, ? modifies the quantifier to not be greedy; in other words, this tells the regular expression to match the least amount of characters possible while still allowing the whole regular expression to match successfully, otherwise it could match the whole document and that's bad. basically i use this as a spacer between the significant data i want.

    $title = '<p class="title">(.+?)</p>';
    the parenthesis capture data so this captures the title

    $img = '<img src="([^"]+)';
    this captures the image. [^"]+ means capture 1 or more characters that are not a double quote, that way it grabs the whole image src.

    $stock = 'Stock#: </span>([^<]+)';
    grab all the information after Stock#: </span> until you hit a <

    and etc. that's the basic idea.

    then i just combined them all with the arbitrary spacing inbetween to make the entire regular expression. with a little more time this regular expression could be made more abstract because as it is now if some simple formatting changed in the template it'd be possible that it wouldn't match anymore.

    oh yeah, i forgot to mention the '~si' at the end. with perl regular expressions you can use any character to mark the beginning and end of your regexp. /'s are used most frequently but i chose to use ~'s which actually helps a lot when performing regexp's on HTML, because then you don't have to backslash out all the /'s found in closing tags. the 's' modifier treats the string you're matching as a single line, that means the .+ will match across newlines and returns whereas usually it won't. the 'i' makes it not case sensitive.
     
    Last edited: Jun 13, 2008

Share This Page