Question : Windows Server 2003 Intermittent Freezing Issues

We have a client that is running Windows Server 2003 and occasionally lately it has been freezing up and becoming nearly unresponsive.  We have gotten a few internittent read errors in the event viewer (see below) and the acronis backup occasionally fails with read errors on random files.  We have ran a chkdsk on the server, we have checked the status of the drives in the raid array, and everything looks good.

here are a couple of errors that we recently got on the server

Log: Application
Type: Error
Event: 439
Agent Time: 2010-05-21 10:38:54Z
Event Time: 02:36:13 PM 21-May-2010 UTC
Source: ESENT
Category: Logging/Recovery
Username: N/A
Computer: THE-ORACLE
Description: tcpsvcs (2504) Unable to write a shadowed header for file C:\WINDOWS\System32\dhcp\j50.chk. Error -1032.

Log: Application
Type: Error
Event: 490
Agent Time: 2010-05-10 08:43:28Z
Event Time: 12:40:39 PM 10-May-2010 UTC
Source: ESENT
Category: General
Username: N/A
Computer: THE-ORACLE
Description: Catalog Database (976) Catalog Database: An attempt to open the file "C:\WINDOWS\system32\CatRoot2\edb.chk" for read / write access failed with system error 32 (0x00000020): "The process cannot access the file because it is being used by another process. ".  The open file operation will fail with error -1032 (0xfffffbf8).

Answer : Windows Server 2003 Intermittent Freezing Issues

That confirms my suspicions ...

Your $50 consumer-class disk drives will simply not do error recovery correctly on this controller.  It is a TLER issue.  Specifically, when you get a bad block, or one that just does not read immediately, then the disk will go into a deep recovery phase to try to get the data.  

When it attempts to read a block that gives either an ECC error or a read error, then it goes into deep recovery to try to get the data. If it recovers, it moves on, otherwise it locks up all I/O until the firmware-specified timeout which is 10+ seconds, depending on the firmware/model.   The problem is that most of the controllers only allocate 7-8 seconds for recovery.   If a drive takes longer than that, then bad things happen, like drives going offline and data getting lost.

You need to run enterprise class disks which are programmed to give up after just a few seconds.  Not only will this minimize the timeouts, but also you may never even see timeouts as they also typically have 2 more ECC bits.  Heck, you pretty much have the same number of data bits than ECC recovery capability, so statistically if you read every bit on the entire RAID twice, you are statistically guaranteed to lose 512 bytes to 64KB.

Also, intel does NOT certify or qualify or recommend these drives for use with this RAID (or any of their matrix controllers for server use).
Seagate does not design those drives for 24x7.  Those disks are designed for 2400 hours use/year light duty.   Do the math on how many days that is.

Your solution is to get enterprise class drives.

You might want to read this ..
http://www.experts-exchange.com/articles/Storage/Misc/Disk-drive-reliability-overview.html

This goes into it in more detail (a wd paper, but TLER is the same issue with seagate disks)
http://www.wdc.com/en/library/sata/2579-001098.pdf

Now they do have a firmware/driver update that may help, but it is not a cure, it will prevent other types of issues.  See link below.
http://downloadcenter.intel.com/Detail_Desc.aspx?agr=Y&ProdId=1657&DwnldID=8849&lang=eng

My formal recommendation is for you to get enterprise class drives that will not have this issue.

Random Solutions  
 
programming4us programming4us