Building simplicidade.org: notes, projects, and occasional rants

Recover day

Yesterday, Feb 18 2005, I had a special day.

I arrived at work in the morning, and I was catching up on email, rss, and reviewing my task list when I get a call from my wife.

We have a e-learning site in Portugal, where we sell online courses. We have several different sites selling those courses, some of them affiliated with portals or job offers sites.

So this site was down, had been for a couple of hours.

First mental note of the day: when your previous monitoring solution is totally trashed, don’t take more than a week to setup something else. The previous spong-based monitoring system was dead for quite some time, and the new nagios-based one didn’t monitor this site yet: I finished installing it the previous day. So Murphy was clearly awake and at work.

The data-center where we have the server has all the redundant stuff one would like (power, network, the works), and I could log on to the back-end server, where the database was located. Only the front-end server, where all the sites where, was down. So I could rule out external factors.

The servers are connected to a IP KVM from Dell, so I started up the Java console and I selected the front-end server. The message was clear: blah blah blag kernel panic blah blah scsi blah error blah blah you are toasted bla blah.

Yep, this is going to be a fun day.

I go down to the data-center and reboot it. The system is able to fsck the boot disk (IDE disk), but the SCSI data disk is a no go.

So now, I switch to the back-end server and see if my daily backups (rsync with snapshots) are OK. The file count seems right, the logs are ok, no errors reported, so we are in good shape. Or maybe not.

Second mental note of the day: whenever you add a new software server to your solution, make sure your backups cover it.

Some months ago, I added a new web server to the mix, and the config file took some working to fine tune. And no, it was not in the backups.

So recovering the disk is now very important to me.

First, get fresh media. I went out and bought two 80 Gig IDE disk drives. I could use one right now, and then add the second one in a RAID-1 setup.

I got with the two disks to the data-center, I power down, connect one of them to the server and boot up. Time to do some brain surgery. I decided to move all the data from the boot disk and the data disk to this new one. An old 8Gb IDE and a 9Gb SCSI should fit into a 80Gb drive, right?

Well, wrong.

Although the motherboard and BIOS could see all the 80Gb, the Linux could not, only seeing 10Gb worth. This is a RedHat 7.2, and there are probably some parameter to turn on LBA support or something, but I didn’t have that kind of time, so I just use the jumpers to limit the disk to 32Gb.

Reboot again, and now I could see the 32Gb.

Time to get the tools. A friend had used dd_rescue with great success so I went with that.

I also needed a rescue disk. The server did not have a CD so it must be a floppy. I decided to use Toms rescue disk. I copied dd_rescue to another floppy.

So boot from the rescue floppy, mount the new disk, partition in a similar way to the others, mount the old IDE, and tar‘ed the OS over to the new disk. Edit fstab and make sure everything is ok. So far so good.

Now start dd_rescue, copying /dev/sda1 to a file in the new disk. Later we will run fsck on it, and mount it with a loop device.

Invalid instruction.

Ok, this is a P3 server, and I compiled dd_rescue on a P4, so my bad. Recompile on a P3. Copy to floppy, copy from floppy. Sneaker net at his best.

Run it again, and it starts to do his job. It should take little time…

… and it’s over. My 9Gb disk drive is now in a file, with only 6kb missing. Wow! Thats great, I now have a 9Gb file in the new drive with only 6Kb missing. Let me just confirm all the files are ok…

ls -la

hmms… that sda1.img is smaller than I though… 2147483648 bytes? That number looks familiar… Doh! 2Gb limit! Arrgs. The rescue floppy is a 2.0 kernel…

So now I need a decent rescue alternative. Well, the OS is is already in the new disk, so let’s boot from that one and finish the job.

Disconnect old drive, connect new drive with OS as primary, keep SCSI connected. Boot from rescue disk, mount root filesystem on new drive, chroot to it, run lilo, remove floppy, reboot.

The system starts cleanly, cool. So the first tar part went very well, and my fstab editing was also good. Well, thanks for that.

By now, I add found a faster alternative to dd_rescue, a shell-based front-end named dd_rhelp.

So I compiled dd_rescue. Went well. Then dd_rhelp. Nooo, you need newer version os autoconf! Download autoconf, configure, make, make install. Recompile. success.

Now try to run dd_rhelp. Syntax errors all over… So, like, you shell, you know, it’s too old… wget bash, configure, yada yada yada…

I run it again and it works! And it should be faster. Also, it has prettier graphics… :)

ls -la

Something about a value to big for something else… du cannot see the file either.

So this glibc can create the file, but you cannot stat it, and you cannot mount it…

Let me recap: we now have a 9Gb file in the new disk will almost all of the content from the old disk, but I cannot work with it.

And I have an extra 80Gb disk drive next to me, free.

Get the Fedora network install CDs, get another computer with a CD, plug second drive, insert CD, answer couple questions, and now we have a bootable fedora core 3 server install, our new rescue disk.

Connect new Fedora as boot device, move the new “32Gb” disk to other IDE channel, boot.

Ahhs, there he is, my 9Gb image is finally visible… losetup and fsck /dev/loop0. He goes and does he’s thing, and reports 6 (6!) files missing, and those files are generated files, rsync’ed from the back-office. So no data loss at all. First good news (after 8 hours work).

Mount loopback device, and rsync the data over. Power down, remove Fedora, reconnect main disk to primary channel.

Reboot.

All is well, the sites come back alive. It’s now 23pm, some 10 hours later. I’ll cleanup the war zone tomorrow (today).


At least the following songs where played while writing this. No animals where injured, though, They run too fast.

Mother of a Girl from the album “3” by Violent Femmes

Memo from Turner from the album “A Greatest Hits Unplugged Tribute” by The Rolling Clones featuring Jimmy Lee

No Sexy from the album “For Every Punk Bitch & Arsehole” by Hang On The Box

Baton Rouge from the album “Ecstasy” by Lou Reed

Everloving from the album “Play” by Moby