Question : Predicted harddrive failure

We've had a few predicted hard drive failures on a few of our PowerEdge servers. The server will be in a 'predicted failure' state, but not fully failed. At this point we've noticed that the server becomes very slow and sometimes not responsive when we first receive knowledge of this. We contacted Dell to see if they have any metric on how many times a read/write has to fail in order to completely fail the drive. Has anyone came across similar experiences or have any metric on this sort of activity.

Thanks.

Answer : Predicted harddrive failure

Having written my first S.M.A.R.T. management product back in 1999 for HP SCSI devices, and having NDAs with the very same manufactures mentioned in this thread as I have written plug ins for their OEMs to interpret, diagnose, and repair various aspects of drive health, let me add a little to the mix.  

* S.M.A.R.T. is a predictive failure technology.  Per the spec, it is generally architected to give you 24-hour notice of impending doom.  False positives and undetected negatives can easily reach over 25% of failure scenarios, sometimes numbers can be much higher.  IN other words, take whatever it tells you under advisement.

* S.M.A.R.T.  (I will now call it SMART because I am sick of the D.O.T.S.) algorithms vary by make/model and firmware, and detection/reporting is profoundly different between ATA and SCSI protocols, so will try to talk big picture, not address specific to a SATA or SAS drive.   All devices have various measurements for things like RPM variations and drive height that can change slightly and indicate a degrading condition.  At some point, a disk will determine that enough is enough, and set a bit and error code byte when the hardware is told to poll itself and report back.   Some software products incorrectly (and this is specifically addressed in the ANSI SPEC as something NOT to do) look at one or two metrics and make a judgement call that the disk is dying.  The correct thing a developer is supposed to do is wait for the disk to tell you.

* Decreased performance due to ECC errors and tracks that are not readable on first pass are not SMART errors, but drive firmware factors this information into the algorithm that determines whether or not the drive warrants a SMART Alert.   One can not "reset" SMART status.  The ANSI specification does not have such a thing.  Now, if you have a disk that will report a disk has failing a SMART test because there is an unreadable block that has not been replaced by a reserved (spare), then you can reset SMART by remapping the bad block, but this is stupid because the disk drive and software like spinrite or whatever has no idea whether or not the file system considers that unreadable block of data as being part of a file

* If you have some late-model SCSI,SAS, or FC disks, then there is a SCSI command family called background media scanning.  (and well, some SATA disks have this too, but this is vendor specific), that allows disk drives to self-clean and repair bad blocks in the background during idle time.   Software such as the santools smartmonux, can enable this feature and run reports.  Most of the late-model Seagate SAS/FC/SCSI disks give you the BGMS command, but Hitachi and others give this as well. Read the programming manual of the disk drive to see if it is there.  Turn it on if you can.

* IN RAID environments, there is going to be a data verification, data consistency, media scan, or something that reads all blocks from all drives and corrects parity and rewrites bad or unreadable blocks.  DO THIS RELIGIOUSLY, ONCE A WEEK.   This will force recovery of bad blocks which could take 5-10 seconds per RAID stripe, if you have low-end hardware.  If you have a NetApp or something more expensive with enterprise drives, then you probably won't see any hit while it runs.

* Run the VERIFY CDB to scan and detect recoverable blocks.  This is what windows does when you run the scandisk /r , but with windows, it just scans a range of blocks.  The /r is the key, as this does the VERIFY.   Spinrite

* If you get a SMART error and have a non-OEM disk (i.e, retail drive with retail warranty), then a SMART error qualifies for a warranty replacement (if within warranty period), so even if the error is a false error, you can get the drive swapped out.   Same  is generally true with the computer manufacturers.   But if you buy your disks at fry's and they are bulk packaged, then forget it. Serial numbers and part numbers are different, and the money you save buying the non-retail versions represent the price delta of buying a disk with a 30 day warranty instead of a 3-5 year warranty.


* RAID controllers is a whole 'nuther discussion, and really needs to be addressed in context to specific implementations, if you want to get deep into it.  Suffice to say that spinrite & HDDRegen should generally never be run on a RAID member unless you already know which blocks are known bad to the controller before you start.  Otherwise when they repair a block, it can very well corrupt data on the stripe.
Random Solutions  
 
programming4us programming4us