URL regex question

Discussion in 'OT Technology' started by beez, Mar 8, 2008.

  1. beez

    beez New Member

    Joined:
    Jun 3, 2004
    Messages:
    19,143
    Likes Received:
    0
    Location:
    Queens
    I've been banging my head against the wall on this one all night. What would the regex be to get "http://www.example.com/lol?d=128218" but exclude "http://www.example.com/lol?d=128218&e=0%12321"? basically I just want to disqualify the string if there's an ampersand in there.
     
  2. whup

    whup I wish you had children and.. so that I could step

    Joined:
    Feb 12, 2007
    Messages:
    1,603
    Likes Received:
    0
    ^http:\/\/www\.example\.com\/lol\?d=(\d+)$

    That will also match any number for the d parameter. Not sure what you really want to do here.
     
  3. beez

    beez New Member

    Joined:
    Jun 3, 2004
    Messages:
    19,143
    Likes Received:
    0
    Location:
    Queens
    I'm working with a Google Search Appliance that is picking up a bunch of crazy URLs. Basically it's appending ?page=0%2C33, ?page=0%2C5, etc. to valid URLs. I'm not sure why this is happening but I think I really just want to ignore any URL where a 0 is followed by a % sign.
     
  4. beez

    beez New Member

    Joined:
    Jun 3, 2004
    Messages:
    19,143
    Likes Received:
    0
    Location:
    Queens
    I totally figured it out. There is an element inside of each page that reflects the URL being requested by the browser. This is fooling the GSA into thinking the pages are unique. It's something to fix on our side rather than something to screw with via regex. I still would be interested in figuring out how to ignore URLs where the value after page= is above some threshold. e.g. http:\/\/www\.example\.com\/lo\l/*.?page=[\d]+$ not sure if that would work but there you go.
     

Share This Page