Storage Soup - A SearchStorage.com blog

Storage Soup:

 

A SearchStorage.com blog


A data storage blog offering commentary on the storage industry, as well as a behind-the-scenes look at developments in storage management, SAN, NAS, backup, disaster recovery and storage strategy.

SATA takes on a life of its own

Companies tend to focus on the positive aspects of using SATA disk drives for a growing portion of their enterprise storage needs but as some companies are finding out, managing thousands or tens of thousands of SATA disk drives can take on a life of its own.

Recently, I spoke to Lawrence Livermore National Laboratories (LLNL) which is a huge DataDirect Networks user. By huge, I mean they use multiple DataDirect Network Storage Systems with the total number of SATA disk drives in production numbering in the tens of thousands, possibly even up to a hundred thousand SATA disk drives. More impressive, LLNL uses these storage systems in conjunction with some of the world’s fastest supercomputers, including the BlueGene/L currently rated #1 among the world’s fastest computers.

The issue that crops up when companies own tens of thousands of disk drives — SATA or FC – is the growing task of managing failed disk drives. Companies such as Nexsan Technologies report failure rates of less than half of 1% of all SATA disk drives that they have deployed out in the field. Those numbers sound impressive until one begins to encounter environments like LLNL that may have up to a hundred thousand SATA disk drives in their environment. Using a .005% failure rate in that scenario, companies can statistically expect a SATA disk drive to fail about every other day, which is inline with LLNL’s experience.

This is in no way intended to reflect negatively on DataDirect Networks. If users were to deploy a similar numbers of disk drives from any other SATA storage system provider, be it Excel Meridian, Nexsan Technologies or Winchester Systems, they could expect similar SATA disk drive failure rates.

The cautionary note for users here is twofold. First, be sure your disk management practices keep up with your growth in disk drives. Replacing a disk drive may not sound like a big deal, but consider what is involved with a disk drive replacement:

  • Discovering the disk drive failure
  • Contacting and scheduling time for the vendor to replace the disk drive
  • Monitoring the rebuild of the spare disk drive
  • Determining if there is application impact during the disk drive rebuild
  • Physically changing out the disk drive

Assuming a .005% failure rate, companies with hundreds of disk drives will repeat this process once a year, those with thousands of disk drives once a quarter and those with tens of thousands once a week. Once a company crosses the 10,000 threshold barrier, companies need to seriously contemplate dedicating a person at least a part-time just to monitor and manage the task of disk drive replacements regardless of which vendor’s storage system one selects.

The other cautionary note is that the more disk drives one deploys, the more likely it becomes that two or even three disk drives in the same RAID group will fail before a recovery of an existing failed disk drive is complete. Companies, now more than ever, need to ensure they are using RAID-6 for their SATA disk drive array groups and, when crossing the 10,000 disk drive threshold, should consider the new generation of SATA storage systems from companies such as DataDirect Networks and NEC. These systems give companies more data protection and recovery options for their SATA disk drives.

3 Comments »

  1. I agree with Jerome’s cautionary notes. While the number of extremely large deployments of Sata are relatively small. Medium deployments are just as vulnerable to loss as disk sizes increase. Medium deployments of sata or FC should be considering the mean time between failure for their media, early failure detection criteria and especially note the rebuild time for large sized disks. The new 750GB and 1T sata drives take a long time to rebuild. If you configure raid sets with a large number of drives (> 5) per raid group, the probability of dual drive failures increase due to the long rebuild time for these drives. The mean time to failure is still small but if all of the disks are new and started duty cycles at the same time, the peak failures should be predictable and extra caution should be exercised.

    Comment by Barry Ribbeck — August 20, 2007 @ 8:40 am

  2. Doesn’t LLNL require shredding of drives? Even though a drive goes bad it could still have recoverable data on it. In a secure site like LLNL procedures should call for proper disposal of replaced drives adding even more cost to a replacement.

    Comment by Tim Laswell — August 20, 2007 @ 10:05 am

  3. .005% disk failure, that’s not much…

    Jerome takes some time out with Lawrence Livermore National Laboratory (LLNL) in Berkeley, CA.  0.005% is a small number if your total is 100, even 1000.  0.005% of 10000 is a different situation all together.  The increase of SATA drive…

    Trackback by DCIG, Inc — September 26, 2007 @ 1:16 am

TrackBack URL

Leave a comment