Saturday, March 03, 2007

Hard Drive Reliability

I love the Security Now! podcast - episode eighty-one this week was a corker. Leo and Steve discuss the distressing results and implications of two recent very large population studies (more than 100,000 drives each) of hard drive field failures. Google and Carnegie Mellon University (CMU) both conducted and submitted studies for the recent 5th USENIX conference on File and Storage Technologies.
Since I have a modest amount of experience with hard drive reliability I thought I'd drop Steve the following;
For the last twenty years I've been working in broadcast engineering and the track of my career has mirrored the uptake of commodity computers over bespoke television equipment.
I had a couple of points - one interesting and one informative that I thought you might enjoy;

  • In the mid-nineties the whole industry was switching over from editing video-tape to cutting shows on workstations. Consequently large and fast hard-drives were needed and Micropolis (now out of business) was one of the manufacturers of choice for most TV facilities. They'd launched a nine gig model (big full-height 5 1/4¨ device!) that the company I was working for was buying in number - even at $2,500 a pop they were thought to be good value! After about six months a number of these drives started to fail. The manufacturer had given us a SCSI utility to see which sectors had failed and it appeared to be the same ones on each drive. In a couple of cases we did a low-level format of the drives (which mapped out the bad sectors) and continued to use them. Those drives then showed problems and when analysed their sectors were clearly failing in very similar patterns. In the end the representative from Micropolis told us that in the case of that series of drives the lubricant they used would leak from the spindle-bearings and spread out across the platters, getting progressively more spread from the centre of the disk.

  • The company I currently work for specialises in editing systems, particularly for film and high-definition television. In the case of HD the data rate off of the videotape is either 1.48 or 3 gigabits per second (unlike the domestic HDV format that manages to compress the video to a paltry 18 megabits per second!). In the case of film (either from a digital film camera or telecine film scanner) the data rate can be much higher. The upshot of all this is that the stand-alone storage systems and SANs (storage area networks) have to stripe many drives together to achieve the required through-put. We are very used to having ten or more drives with data striped across them and the dirty little secret we shy away from is that the mean-time between failure of a ten-way drive set is only one tenth of a typical drive. Consequently our tech-support department is always trying to resurrect dead fibre-channel drives. We tell (or even try and bully) customers into keeping backups but with many terabytes of data it is a hard thing to enforce. This is why mixed striped/mirrored drive sets are becoming popular. Anyhow - one of the things I have found to be useful in temporarily reviving a dead drive (and I've done it maybe a dozen times) is to freeze a disk. It might sound crazy but it you consider that the most common reason for mechanical failure in a drive is the bearings becoming loose and the drive spinning eccentrically you can see the reason. The cold temperature tightens everything up as it shrinks and (temporarily) allows the thing to work at specification. The only option is to clone the drive and then throw the suspect one away.

No comments: