Currently doing the scariest thing I've ever had to do at work

Discussion in 'OT Technology' started by Mike99TA, May 18, 2008.

  1. Mike99TA

    Mike99TA I don't have anything clever to put here right now

    Joined:
    Oct 3, 2001
    Messages:
    4,553
    Likes Received:
    0
    Location:
    Greenville, SC
    So I'm at home, doing some planned work this weekend, and its probably the most stressful thing I've had to do (that was planned, obviously unplanned outages can be very stressful).

    We have a 1TB SAP instance running on our XP storage array on a SLES10SP1 server. We need to migrate the data from LVM to Storage Foundation Disk groups/vxfs partitions to introduce the system into our new backup infrastructure.

    Our XP array is diced into 20GB LUNs, so this basically involved creating a Business copy of 50 LUNs, copying the data, splitting it off, installing Storage Foundation, blow away all the existing SAP/Oracle data and reinitialize the disks with Storage Foundation, zone the copied disks into the server, bring the old data online under different mount points, and copy all the data to the new disks. Of course theres a ton of extra steps in there I didn't list (like validating the data on the copied disks before wiping the originals, etc), but thats the short of it anyway. Oh, and I forgot to mention, no current backup of the instance because it would take around 4 hours to back it up and we don't have that long of a maintenance window, so the only copy of the data is the disk copy, and the Netbackup from last weekend (hardly possible to even use it).

    I about threwup when I had to wipe the data off the original disks as the maintenance window is only until 2am at which point everything has to be back online or we start losing gobs and gobs of money (as I've stated here before, something to the tune of $45,000 every 118 seconds).

    Fortunately, everything has been going smooth and the copy is taking place now but still... ugh.

    What have any of you done at work that was dangerous enough that it made you feel sick?
     
  2. P07r0457

    P07r0457 New Member

    Joined:
    Sep 20, 2004
    Messages:
    28,491
    Likes Received:
    0
    Location:
    Southern Oregon
    i had some bad chinese food once.

    but the good news is i saved a buttload on my car insurance!
     
  3. trouphaz

    trouphaz New Member

    Joined:
    Sep 22, 2003
    Messages:
    2,666
    Likes Received:
    0
    lol, man, storage is always more stressful than servers. i've done very similar shit to what you've done, though luckily never possibly affecting $$$. instead, i just had a somewhat overzealous director who was certain someone was going to be fired if something bad happened.

    oh, and one thing to keep in mind with business copy/continuous access/trucopy/shadow image (whatever you want to call it, any replication where you are controlling with that shitty HORCM software).... it is possible to have some of the disks in a group copying one way and others copying the other. that is how i caused my most major blunder where we had 2 servers in a BC relationship so they could do maintenance on one and then copy it over to the other (it was an app with a shitty database that couldn't be clustered). in my panic and worry that i was going to do a pvol -> svol copy instead of vice versa, i accidentally ran the command the wrong way and immediately hit ctrl-c (the interactive option didn't work as i expected, i thought it would verify what exactly i was going to do before it did it). in doing that i got the first few disks copying one way. i then ran the correct command which got the rest doing a proper restore. since we had striped at the host level using HP's LVM, the entire 4TB of data was rendered worthless. luckily we had a CA copy on a 3rd server for backups.
     
  4. Sexual Vanilla

    Sexual Vanilla New Member

    Joined:
    May 23, 2005
    Messages:
    6,305
    Likes Received:
    0
    Location:
    South Carolina
    Any time I get called to HR :wtc:
     
  5. Mike99TA

    Mike99TA I don't have anything clever to put here right now

    Joined:
    Oct 3, 2001
    Messages:
    4,553
    Likes Received:
    0
    Location:
    Greenville, SC
    That sucks. We're protected quite nicely from that, as we generate all our HORCM files from scripts that use our master array configuration file (CU:LDEV list of disks with the VG they're part of, and the application, one file for each array). Basically when we secure LUNs and mark them off in our master config files, we run a create_horcm script that parses the config file, and then manually creates a horcm file with the from->to LUNs. We also only synchronize our disks a single direction, ever. We have protection implented for that as well. If we ever need to revert data we usually do some sort of a host based copy, or we secure both sets of disks to a single server, etc. Been using BC/CA for ~8 years or so and have never had an incident (yet). Also our CAs are set so its not even physically possible to sync the other direction (the ports themselves are RO on the XP).

    I'd shit a brick of something like that actually happened though.
     
  6. deusexaethera

    deusexaethera OT Supporter

    Joined:
    Jan 27, 2005
    Messages:
    19,712
    Likes Received:
    0
    Never had to deal with anything that bad. Though, I did once accidentally reformat the domain controller for an old domain that belonged to a company we had recently bought. That was an uncomfortable meeting in the VP's office.
     
  7. trouphaz

    trouphaz New Member

    Joined:
    Sep 22, 2003
    Messages:
    2,666
    Likes Received:
    0
    wow, you never restore? that incident i mentioned did end up being a nightmare because our XPs were setup like you were saying. we used CA for backups (we had 2 XPs in the same room since we got one for free) and the guy who set it up setup 4 ports all going from one array to the other with none set for reverse and it took HP a while to get them reconfigured at 3am so we could recover. with 4Tb of data, it would've taken forever to manually copy the files.

    our HORCM configs were all fine. it was just a stupid mistake on my part. we used to sync pvol -> svol weekly. clients would go to svol server while the app team did maintenance on the pvol sever which locked the DB for a few hours. once it was done, instead of redoing the work on the other side they just switched users over and synced the BCs in the background. due to the shittiness of the database, every so often one of them would get corrupted and the repair tools were made for a much smaller data set. i think they figured it would take 2 weeks to actually scan the database to repair it, offline with no data loading the entire time. so, keeping 2 copies was easiest. pvol system got corrupt once so i had to do a restore. it was when we first started using it, so it was my first restore and they woke me up at 3am to do it. you can bet that after that i double and triple checked every single command before i hit enter. :)
     
  8. Mike99TA

    Mike99TA I don't have anything clever to put here right now

    Joined:
    Oct 3, 2001
    Messages:
    4,553
    Likes Received:
    0
    Location:
    Greenville, SC
    I feel your pain. Never had that happen with SAN storage before but certainly have made mistakes in the middle of the night before. The way we use our Business Copies is a little bit out of the norm (We've been told by HP that we're the only customer they know of doing what we're doing) and we use our CAs mostly as a DR copy - they're secured to a server at a DR site. The few times we've had to restore data from BC disks has been on HP-UX servers, which we can secure the BC disks right to the server accessing the original disks without messing up any LVM information (man I love HP's ioscan and LVM implementation compared to Linux). With linux this wouldn't be the case so its possible in the future we might have to do restores with the business copies...hopefully no mistakes then :big grin:
     
  9. trouphaz

    trouphaz New Member

    Joined:
    Sep 22, 2003
    Messages:
    2,666
    Likes Received:
    0
    what exactly are you doing with your BCs that is so unique?

    i miss ioscan and LVM now that i'm stuck working with Solaris and VxVM. i don't mind VxVM so much, but i'm nowhere near as fluent as i am with LVM and i hate having to rely on a gui.
     
  10. Mike99TA

    Mike99TA I don't have anything clever to put here right now

    Joined:
    Oct 3, 2001
    Messages:
    4,553
    Likes Received:
    0
    Location:
    Greenville, SC
    I do like VxVM but it still has the same issue (in regards to Business Copies) that LVM in Linux has and that LVM in HP-UX does not have. Since LVM on hp-ux does not automatically scan disks for VGs and import then like Linux does, we can secure 2 identical sets of disks to the same system, and use our master XP configuration file to only import a specific copy of the VG using the specific LUNs that belong to the master or svol copy of that VG. Its very nice and much safer than Linux, where if you secure 2 sets of the same disks to it, it will automatically try to import the VG and probably mangle the entire thing as it will most likely use some disks from each copy, or try to multipath disks that aren't actually the same device, or who knows what other problems you could have.

    As far as Business Copies go, we're not so much doing unique business copy setups as much as we are using the Business copies for a unique purpose. We have a very large completely homegrown script (actually 2, one for HP-UX in ksh and one for linux in Perl) that we've written that runs for all of our production critical databases (from 100GB all the way up to 4TB oracle DBs).

    Its kind of long to type out but basically the script runs through a sequence of events for multiple sets of business copies. First it establishes a Business Copy of a DB, then it sets the DB on the prod server to hotbackup mode, then it splits the business copy, turns off hot backup mode, waits for awhile, resyncs just the LUNs that hold the archive log data, splits it back off, mounts up all the filesystems, replays logs, verifies database in read only mode, runs a backup of the business copy data mounted up on the server, deletes all the archive logs that were applied on the business copy data from the production server, and then starts the whole process over. Now imagine 2 of those sequences going for each production database (staggered halfway so that when one is establishing the business copy the other one is applying archive logs and backing up a separate business copy hanging off the CA at a remote site), going 24x7 nonstop. Basically we're synching and splitting hundreds and hundreds (probably thousands) of LUNs constantly throughout the day (On average most of the databases get backed up 1-2 times a day).

    Its nice because we can backup all of our databases without taking the prod db down and its much much faster than rman or anything like that (and does not affect the performance of the prod DB or prod DB server at all), plus we can stop the processes at various steps to "hold" the business copy database in a specific state (ie: leave it established, or let it apply logs up to a certain date and then stop, leaving the copied db at a spot in time for restore purposes if needed, etc).
     

Share This Page