Batch Site Conversion (?)

Discussion in 'OT Technology' started by sablehonda, Jan 16, 2006.

  1. sablehonda

    sablehonda New Member

    Joined:
    Sep 21, 2002
    Messages:
    8,514
    Likes Received:
    0
    Location:
    Chicago
    I have a project to convert an existing site from a content management system to a "normal site."

    The CMS is called "Revize" and fills standard static pages with hundreds of JavaScript client-side variables that make a normal 15kb page over 100kb.

    Here is an example of some of the code it adds to each page:
    Code:
    <img src="/revize/images/edit/new_sm.gif">
    <script language="JavaScript" type="text/JavaScript">
    RZ.module = 'about_links'; RZ.linkname = 'a_link2';
    RZ.template = ''; RZ.recordid = 'new';
    RZ.nexturl = '';
    RZ.popupwidth = ''; RZ.popupheight = ''; RZ.popupscroll = '';
    RZ.img = '<img src="/revize/images/edit/new_sm.gif" />';
    RZ.set = '';
    RZ.options = '';
    if (typeof RZaction != 'undefined') RZaction('newitem');
    var count = 0;
    </script>
    
    Basically it adds some variables to almost every object on a page.

    I have "ripped" the site and there are over 4000 pages.

    My question is does anyone know of a good batch method to strip these CMS JavaScript sections out?

    I was thinking of writing a Perl script to traverse through the directory structure and remove <script>...</script> sections but I was also wondering if anyone knew a batch tool that could go through and pull out just legitimate HTML (no JavaScript).
     
  2. FallNAngel

    FallNAngel ...destroyer of threads...

    Joined:
    Jan 15, 2002
    Messages:
    939
    Likes Received:
    0
    Location:
    Right Here (near Chicago)
    Unfortuantely, I don't know of any such tools, but creating a Perl script to remove the script content would be much easier than trying to keep the HTML content. Hell, getting the perl script to traverse the directories would be harder than ripping out <script.*</script> content. The real problem is what if there's script content that's actually *needed*. How to go about keeping that content will also be a nice trick.
     
  3. sablehonda

    sablehonda New Member

    Joined:
    Sep 21, 2002
    Messages:
    8,514
    Likes Received:
    0
    Location:
    Chicago
    Well the only way I know of removing <script>*</script> is through a batch Perl script. Maybe Dreamweaver could do it.

    How would you go about it?
     
  4. FallNAngel

    FallNAngel ...destroyer of threads...

    Joined:
    Jan 15, 2002
    Messages:
    939
    Likes Received:
    0
    Location:
    Right Here (near Chicago)
    The only (easy) way I can think to remove the script content is through a Perl script as well. I don't believe Dreamweaver supports what you're looking to do and PERL is basically *made* for this kind of thing anyway.

    The only *issue* you should have with going with a perl script to remove the content is keeping the scripts that *should* be there (the ones *not* added by the CMS). The code snippet you first posted is a snippet of javascript that's initializing some variables... but that's it, it's isn't actually *doing* anything...which makes me wonder what else it added into the file that *does* use those variables.

    My point being, is removing the <script> tags going to be enough? For instance, the CMS might have added other garbage into the HTML that references those variables and functions, but doesn't get stuck between <script> tags; javascript such as <input type="button" onclick="somejavafunction()"> You will have removed the actual function "somejavafunction" that was in the <script> tags, but not the code that actually calls it... which will lead to pages generating errors and likely not working.
     
  5. sablehonda

    sablehonda New Member

    Joined:
    Sep 21, 2002
    Messages:
    8,514
    Likes Received:
    0
    Location:
    Chicago
    I used Dreamweaver. There is a find/replace function where you can specify a tag (<script>) and contents between the tags (RZ*) and replace with whatever you want ("").

    It worked pretty well and removed 13,000 lines of code in a couple hundred pages.
     

Share This Page