So much fun it should be illegal.

Fri 18 July 2008 by jillian

So, yesterday some of you may have noticed that sessrumnir.rootaction.net was down. Today you might notice that it's back up. I bet you are wondering what happened. Or not. That's ok, I'm going to give you the short version anyway. :)

The system experienced multiple concurrent physical failures. Specifically, the connector on one of the disk drives crumbled (not the cable, the SATA II connector block with the "pins") and the mainboard failed. It is possible, but unlikely, that the two are related. Read on, if you dare...

The system appeared to have spontaneously powered off, and would not power back on. Suspecting the power supply, I reached for my trusty voltmeter and, using pinouts and procedures from elsewhere, determined that the power supply was likely still good. Next thing to do was unplug everything and start plugging cables back in until the system wouldn't power up anymore. During this part of the process, the SATA connector block (not the cable, the block) on the back of one of the RAID drives (a Maxtor 7H500F0) crumbled as I was carefully unplugging it. I reasoned that this might have been the culprit, but set the disk aside and continued plugging things in and the system fans kept coming on (which they hadn't been doing before I unplugged everything) so I was rather encouraged...

...until I plugged in the final mainboard power connector, which is responsible for powering up the CPU, whereupon the fans stopped powering up. Not long after this, my friend K. showed up with a new disk in hand and we faced the increasingly appealing prospect of dismantling the machine to extract the CPU and MSI K9AGM2 motherboard for return to Fry's.

So, K. volunteered to drag the hardware back to the returns desk so that I could give the kids some attention before bed time. We now have a new MSI K9-A2GM board, courtesy of Fry's protection service contract. We also have a new disk, courtesy of K., to keep things running while Maxtor replaces the failed one under warranty. I may try taking it back to Fry's now that I know the data is safe. Even though it was obviously an opened-box item when I bought it, I expect them to try to tell me they won't cover physical failure. This may be a good time to point out to the reader that I've owned a number of MSI boards and Maxtor drives over the years and been quite pleased with them, they usually perform well for 5 or more years.

The process of recovering the RAID was rather painless. Luckily, the disk that failed was not the one that had the boot sector already set up. So all we had to do (really) was:

  1. Tell the initramfs that it was ok to start in degraded mode (mdadm -R /dev/md[0-5]; exit) then wait 2 hours for filesystem and quota checks to finish.
  2. Partition the new disk.
  3. Add the new partitions to the RAIDs, using: mdadm /dev/md``X``-a /dev/sdb``Y`` for each raid device/partition pair.
  4. Watch /proc/mdstat to see that it was actually doing something.

3 hours later, the RAID was rebuilt (not too bad for 500GB). The harder part was getting the on-board RTL8168c/8111c ethernet to work (it had worked with the K9AGM2), and it turns out this was inadvertently made more difficult by some auto-configuration subsystem called "udev" which tries to make sure that the same devices always have the same names, even when they are detected in different order:

eth0: RTL8168c/8111c at 0xffffc20001040000, 00:1d:92:b5:79:6a, XID 3c2000c0 IRQ 509
...
udev: renamed network interface eth0 to eth2
r8169: eth2: link up
r8169: eth2: link down
ADDRCONF(NETDEV_UP): eth2: link is not ready

In this case, udev was renaming the on-board LAN (eth0) to eth2 because it was reserving the name eth0 for a different PCI ID and MAC ID from the previous mainboard. Even though tcpdump -ni eth2 would monitor traffic on the LAN, the device reported the link was down, and the kernel wouldn't listen (non-promiscuously) or route traffic with it.

To fix this, since the old LAN port was never coming back, I edited /etc/udev/udev.conf and removed the entry for eth0. I then renamed the auto-generated entry for the new board's LAN port from eth2 to eth0. It seems that for some reason the r8169 driver doesn't handle renaming after initialization, and won't detect that the link is ready if udev renames it. At least, that's what "seems" to be going on in Ubuntu 8.04LTS with linux-2.4.26-19 and prior kernels.