'tis the season for disk failures

Wow — two disk failures in totally disparate systems in a single day.

The first is the [Linux] server at my church. It’s a Dell PowerEdge server with 3 disks running in a RAID5 configuration (which I have Kim B. to thank for convincing me that “what the heck; we might as well run RAID5, right?” — woot!). Earlier this week, the server stopped responding entirely. The on-site staff rebooted it and it all came back fine, but nothing showed up in the logs as to why it was failing. So we scratched our heads and went on with life. The next day, it “half-died”, meaning that Samba kept working fine (which is the main purpose of the server), but incomming ssh connections would hang halfway through the authentication.

The on-site staff connected a monitor to the machine and saw that there were SCSI errors on the console (but not in the logs!). I got in early the next morning and found that one of the three disks was issuing media errors, so I forced it offline. RAID5 took over without a beat and no data was lost. Woot!

The machine is still under warranty, so Dell overnighted a new disk that should be there today. The hardware RAID is hot-swappable; I’m told that I can just plug in the new disk and it will automatically start rebuilding the RAID5.


One of the two brand-new SCSI disks in WOPR (the server that hosts squyres.com and several other friends’ domains) also died. It’s running software raid (RAID0+1, IIRC?), and has been issuing warnings for a few days, and finally totally failed last night. This caused massive badness in the machine (unresopnsiveness, inabaility to hard reboot, etc.). Bmoore spent a good amount of time with Jason on the phone (the on-site tech); they managed to coax it back into life by somehow convincing the software RAID that the disk wasn’t there (the RAID was failing in odd ways when it thought that the disk was there, bringing the entire machine down).

I bought these two new disks in early December from a low-cost supplier (no names mentioned). Luckily, they have a 365 day warranty. The failure occurred supposedly during their business hours, but I couldn’t get anyone on the phone. According to their web pages, after I filled in a web form for warranty service, I can supposedly expect an RMA number within 2 days (!). Apparently (it’s not 100% clear from their web pages), we have to ship the disk back to them and then they’ll ship us a new one.

So I’m thinking that it’ll be at least a week before we get a new disk — squyres.com will be running without RAID backup for the entire time.

Just contrasting this with my experience from Dell, I’m probably never going to buy from these low-cost vendors again. The immediate/no-hassle/no-fuss service from Dell was worth the extra cost.


Just a quick clarification: WOPR is (was) running straight up RAID1. Currently, all partitions but root are running on RAID in degraded mode, but the root partition, for some unknown reason, refuses to start up the raid, so it is running bare (w/o RAID).

Also, Jason isn’t just the on-site tech, he’s the CEO of Datility Networks http://www.datility.net/ where the WOPR is hosted. I spent over an hour on the phone with him working on WOPR last night. And they called ME when they received an alert that WOPR had stopped responding. I have to give mad props to these guys, and whole-heartedly recommend them to anybody looking for hosting / co-location services.

