Search & Replace Challenge

Discussion in 'OT Technology' started by Astro, Feb 5, 2003.

  1. Astro

    Astro Code Monkey

    Joined:
    Mar 18, 2000
    Messages:
    2,047
    Likes Received:
    0
    Location:
    Cleveland Ohio
    I'm trying to find the right tool for this job. I've got a lot of ideas, but was wondering if people might have some more ideas. Here's the problem:

    Links on a site got changed. There's 1300+ files that need to be updated. The links look something like this:

    a href="/dir1/dir2/[filename].HTML"

    The trick is I need to preserve the file name and swap out the /dir1/dir2/ .HTML part (yes, the ".HTML needs to be dropped).

    But wait, thats not all!

    There's about 12 different file types ranging from HTML, PDF, GIF, JPEG, etc that need to be replaced as well. So to give an example, we want to change "/dir1/dir2/[filename].[file extension]" to "/newdir/newdir2/[filename]" (no file extension).

    Easy eh?

    The catch is half of these files are encoded as UTF-8 with multi-byte encoding. So your run-of-the-mill app HAS to support UTF-8 (and ideally on the fly).

    Solutions I've been playing with:

    --

    Edit+: This program rocks. It rocks a lot. I created a regexp search and replace which works great. Two problems:

    1. Its multi-file search/replace is a pain in the ass. It works great on one file at a time. This can be painful when there's 500+ files.

    2. It does NOT support UTF-8 two (or more) byte encoding.

    --

    BBEdit: Although a Mac program, this program shows the most promise. We've only done some quick tests and it appears like it may do everything we need it to do (multiple files and regexp for the replace). Sadly, its on a Mac but if it gets the job done then its cool and I'm happy.

    --

    WinGrep: This program really rocks (and is very fast) and I was able to setup a regexp that works for searches, but it doesn't do regexp for the replace string so we're stuck. We can use this tool to find the matches, but we can't use it to do the replaces. It does handle UTF-8 files nicely.

    --

    Perl: I haven't looked into this. If it comes down to it, I'll write something to parse the files. Not sure if Perl is UTF-8 friendly, but maybe if I tell it to handle the files as binary, then life will be good.

    --

    Java (JSP/Servlets): I wrote something that does this. Its not the fastest thing and it completely butchers the UTF-8 files. Maybe if I fiddle with the code to treat the files as binary it might work a little better.

    --

    Authentic Unix flavored grep: I understand with the right command line options, you can get grep to do a ton of nifty stuff. I am not a master of this tool but I understand WinGrep is based off of it. The question is if it can handle a regexp for its replace string and will it destroy UTF-8 files.

    --

    Kick it old-school: I don't want to do this by hand so this is not an option.

    --

    Other thoughts or ideas to doing this?
     
  2. 5Gen_Prelude

    5Gen_Prelude There might not be an "I" in the word "Team", but

    Joined:
    Mar 14, 2000
    Messages:
    14,519
    Likes Received:
    1
    Location:
    Vancouver, BC, CANADA
    So... the files have "moved" and are in the proper format, but the html pages linking to the files needs updating correct? Or the files also need to be moved? And if it's just the former, what kind of files other than text files (HTML included of course) need to be changed?
     
  3. Astro

    Astro Code Monkey

    Joined:
    Mar 18, 2000
    Messages:
    2,047
    Likes Received:
    0
    Location:
    Cleveland Ohio
    The files have kinda moved. Basically, as far as I'm concerned they will already be in the proper location. Any file that references this old location needs to be updated (so a boat load of HTML files). But there's GIF, JPEG, SWF, PDF, DOC, PPT, XLS, and more that are linked from these HTML files. So just these links need to be updated. There's some other steps involved, but they can't really be automated. Automating the search and replace will save a ton of time...
     
  4. 5Gen_Prelude

    5Gen_Prelude There might not be an "I" in the word "Team", but

    Joined:
    Mar 14, 2000
    Messages:
    14,519
    Likes Received:
    1
    Location:
    Vancouver, BC, CANADA
    So basically search all HTML or HTM files for references to any files that reside in the olddir1/olddir2 directory? Well myself I would do it in VB only because that's what I know. It would be pretty quick I think. You would have to assume some things though. For example all references are enclosed in quotes, all file names have a single period (ie no my.sample.doc.txt files). Even better if everything always says a href but not necessary. The more you can assume about each reference, the easier it is to ensure it gets replaced properly.

    Out of curiosity, why is the extension dropping?
     
  5. 5Gen_Prelude

    5Gen_Prelude There might not be an "I" in the word "Team", but

    Joined:
    Mar 14, 2000
    Messages:
    14,519
    Likes Received:
    1
    Location:
    Vancouver, BC, CANADA
    Oh and are there layers of subdirectories or are all the file names in question in that one directory?
     
  6. Astro

    Astro Code Monkey

    Joined:
    Mar 18, 2000
    Messages:
    2,047
    Likes Received:
    0
    Location:
    Cleveland Ohio
    Actually, I left out a detail:

    Some links may be like:

    href="/dir1/dir2/[filename].[fileextension]#[internal link]

    In this case the #[internal link] MUST be kept (otherwise it makes a really big mess and a lot of hand coding which would defeat half the purpose of this.

    Regarding the reason for dropping the extension is because there's going to be a java servlet which just needs the file name but not the extension. It will be able to figure out what it needs to do based off the file name.

    With 1300+ HTML files, anything can go for links, but they appear pretty standard (quotes are in the proper spot, etc).

    The method you described isn't going to be flexible enough. As you mentioned, there's some assumptions that are taken with your approach and that can be dangerous depending on the circumstances.

    I'm also assuming your approach would be based off of reading the file in line by line and then doing a string match. Then using VB magic it would be a matter of substring-ing the input line and putting the line back together and then outputting it.

    I've worked with VB. Its string handling is functional, but its string handling is combersome and not overly powerful. From where I sit, VB could get the job done, but its not a tool of my choice.

    What might help is this will only be a one time shot (we hope). If this was going to be a multiple occurance, then I'd give VB a stronger consideration (its easy to create a simple interface for the user).
     
  7. Astro

    Astro Code Monkey

    Joined:
    Mar 18, 2000
    Messages:
    2,047
    Likes Received:
    0
    Location:
    Cleveland Ohio
    All in the same directory, which is nice for this exercise, although not a way I would go with my personal sites.
     
    Last edited: Feb 6, 2003
  8. 5Gen_Prelude

    5Gen_Prelude There might not be an "I" in the word "Team", but

    Joined:
    Mar 14, 2000
    Messages:
    14,519
    Likes Received:
    1
    Location:
    Vancouver, BC, CANADA
    I'm not sure why it wouldn't be flexible - I would think you would have the same problem regardless of which language you choose. Unless you provide a list of extensions prior to the script running, you have to come up with some rules of what is an extension and what isn't. In other words, the same logic you would use to do it by hand, must be implemented regardless of which tools you use.
     
  9. Astro

    Astro Code Monkey

    Joined:
    Mar 18, 2000
    Messages:
    2,047
    Likes Received:
    0
    Location:
    Cleveland Ohio
    I'm not sure where you were going with this. I don't have any problems dropping the extension even though it may have a variable length and changes. The problem is finding a tool I can use to reliably execute the regular expression and preserve the UTF-8 encoding. With the regular expression I put together, I am defining the possible extensions with an "or" only because the parser was getting greedy and didn't' know when to stop at "#" or " (maybe if I had more time I could hammer it out, but I decided to specify the extension instead). So this part of the equation is not an issue.

    In reference to VB, I'm being picky and do not want to use that as a tool (also our shop does not have it).

    Its string handling is functional but painful (this is my opinion after doing string handling with ASP JScript, JavaScript, PHP, Perl, JSP, and others). Assembler and C/C++ (console flavored) are not my tool choices either. Their string handling is either basic or completely do-it-yourself (makes VB look pretty good).
     
  10. 5Gen_Prelude

    5Gen_Prelude There might not be an "I" in the word "Team", but

    Joined:
    Mar 14, 2000
    Messages:
    14,519
    Likes Received:
    1
    Location:
    Vancouver, BC, CANADA
    If you have a copy Office, you have a copy of VB
     
  11. Kabuko

    Kabuko Guest

    I would use VB... string handling in VB is just fine functionality-wise. It has some nice functions to use. The biggest problem with VB strings is performance, not ease of programming. 1300 filenames would be quick to process in any language. Even if you don't have Office, you can write a VBScript and use the WScript Host to run it. Can't use regular expressions (as far as I know) with VB, but it's a rather simple filename parse, so I don't think it's necessary.
     
  12. Astro

    Astro Code Monkey

    Joined:
    Mar 18, 2000
    Messages:
    2,047
    Likes Received:
    0
    Location:
    Cleveland Ohio
    I've seen some documentation that suggests regular expressions can be done in VB.

    Parsing without using regular expressions can be done, but the regular expression itself is not that difficult to assemble. From a programming standpoint the code required to execute the regular expression is a bit less then trying to code in all the rules necessary (and can you say recursion? - multiple links can/do occur on a single line)

    Why reinvent the wheel when I found this:

    http://www.divlocsoft.com/

    It did EXACTLY what I needed it:

    - regular expression search
    - regular expression replace
    - kept the UTF-8/Unicode integrity - at least for what I needed. I'm curious if VB's string handling would have preserved the Unicode or if it would have choked or mutilated it.

    So, the job got done pretty quickly. Took a little bit of time to fine tune the regular expression search string. I found there were a couple additional catches involved that I wasn't aware about, but in the end it was all good.
     

Share This Page