Saturday, January 26, 2008

Disk monitoring and tuning with dd and S.M.A.R.T. - Reallocating bad sectors and predicting disk failure

What is S.M.A.R.T.?

Modern disk drives will automagically reallocate bad sectors on the fly, as soon as they encounter some kind of R/W/ECC error. But in order for this to happen, it must first access that sector. This is why you never see surface errors on modern disks.

Modern hard drives (ATA and SATA) have S.M.A.R.T. - Self-Monitoring, Analysis, and Reporting Technology. Once you have that enabled in BIOS (assuming you have a S.M.A.R.T. capable disk and controller) you can monitor a number of disk health and performance parameters.

What you should keep an eye on is the Reallocated Sectors Count (if the drive has a problem with a R/W/ECC error it will mark the sector "Reallocated" and transfer the data to a spare area on the disk). This will result in some performance decrease, and is a sign of imminent disk failure.


Monitoring S.MA.R.T.

ATA and SATA disks:

To monitor S.M.A.R.T. data you can use HDTune on Windows or SmartMonTools (smartd, smartctl) on Darwin (Mac OSX), Linux, FreeBSD, NetBSD, OpenBSD, Solaris, OS/2, or eComStation systems. If you're up to it, you can also use SmartMonTools on Windows.

USB Enclosures:
While in most cases you should have no trouble using HDTune or SmartMonTools, some USB drive enclosures may be resilient to monitoring with S.M.A.R.T. programs and will require vendor software. In such cases, you can download vendor software to perform monitoring, like "Western Digital Data LifeGuard Diagnostics".

iPods:
You can also get S.M.A.R.T. info on your iPod. You can either configure it to act as a pass through device (regular USB media) or boot your iPod in diagnostic mode. You can check S.M.A.R.T. disk data and perform more test on your iPod. To do so, you must reset your iPod and hold REW + Select (5G) at the Apple boot menu. For other iPod models, see here (or Google Apple Diagnostic Mode your iPod Model).

Forcing the disk to remap damanged sectors

Now you should know that if you see any problems with Reallocated Sector Count, Reallocated Event Count, Seek Error Rate, Offline Uncorrectable, UDMA CRC Error Count, Multizone Error Rate, Hardware ECC Recovered values, you should consider getting a new disk. These are all signs of a failing disk. Learn more about S.M.A.R.T. attributes and their meaning here. Note that depending on vendor, there may also be enhanced or propriotary S.M.A.R.T. attributes. Read your HDD vendor documentation.

But sometimes you just need to get a bit more life out of a disk, and force the disk to reallocated damaged sectors. You can do so easily by performing a full raw disk read and write operation. For this, you can use the UNIX "dd" tool. Make sure your target disks aren't mounted (Type "mount" to list mounted disks then use "umount disk").

You can perform a disk read operation (reading the whole disk) using a syntax similar to:

# dd if=/dev/disk of=/dev/null bs=2048
You can perform a disk write operation (zero out the disk, this WILL result in data loss) using syntax similar to:
# dd if=/dev/zero of=/dev/disk bs=2048
Now you may wish to perform both a read and write at the same time, and not wipe out your disk data (zero it out). You can perform such a "disk refresh" using syntax similar to:
# dd if=/dev/disk of=/dev/disk bs=1m
This will read and rewrite the data to disk in 1MB chunks to prevent presently recoverable read errors from progressing into unrecoverable read errors.

Of course, you should read the dd manpage for your OS (on Windows you could use a dd for Windows implementation or resort to some sort of Linux or BSD LiveCD). Replace /dev/disk with your disk (make sure you're using the right disk). On Linux you can find out what disk you need to use from "dmesg" or /proc/partitions:
# cat /proc/partitions
You can also use "fdisk -l" to list partitons on your disk, see if that's the right disk
# fdisk -l /dev/hda
Do note that you need root permissions for all of this activity, so on some Linux systems you may need to use "sudo -i" to get a root shell, or precede all operations with "sudo".

While you're doing this rewrite operation, you should monitor the kernel log (dmesg). You can monitor /var/log/messages for this:
# tail -f /var/log/messages
You usually watch out for "DriveReady SeekComplete Error status=0x51 DriveStatusError error=0x04" or some other error.

You should also keep an eye on the Reallocated Sectors and other Interesting Parameters in smartctl:
# smartctl -A /dev/hda
Do this every now and then, and note the values before you've started the operation.

Once you begin the "dd" operations you can send dd a SIGINFO signal (use pkill / kill / whatever) to make it print out I/O information (progress). Some shells / TERMS also respond to Ctrl-T by sending SIGINFO.
# pkill -SIGINFO dd

Once you're done with dd and S.M.A.R.T. tools you should also perform a filesystem check (fsck / chkdsk / whatever).

Conclusions:
  1. Monitor S.M.A.R.T. data with smartclt, keep an eye on Reallocs. Consider getting a new disk if you see reallocated sectors
  2. Perform a disk refresh with dd in order to prevent recoverable read errors from progressing into unrecoverable errors. You don't need fancy tools like SpinRite.
  3. You can use a simple Linux or BSD LiveCD to perform the disk refresh.
  4. This is NOT a data recovery procedure. If you're doing data recovery, use something like dd_recover to a separate media.
  5. This is NOT a step by step tutorial. Read your OS manpages to make sure you're not wiping out the wrong disk or something.
  6. Always monitor S.M.A.R.T. parameters in order to spot disk failure before it happens.
  7. Always keep backups.

Links and resources:

2 comments:

dali said...

Impressive knowledge!! I was wondering if you could explain something in a bit easier terms. I am trying to understand the dd command or disk utility (if it exists) to recover files from an external hd that does not want to be recognized...
if this is not the right way to ask... please forgive my ignorance in the matter. thanks

cmihai said...

dd (and various variants) basically does a raw copy of a target (if) to a destination (of). Variants like dd_rescue attempt to skip blocks that cannot be read or attempt to re-read them, etc. (you can also use conv=noerr to continue copying in case of errors).

If you can't mount a file system or there's issue with the partition table, etc. you can still perform a raw disk image (dd if=/dev/disk of=/whatever bs=2048 conv=noerr) or something along those lines.

The basic idea is to copy as much data as possible to a disk image, then attempt recovery on that. Tools like dd_rescue will attempt to copy a bad sector multiple times, and disks do automagically remap bad sectors on the fly (if possible). So data recovery is very much possible.


But if you disk isn't recognized by the system at all there's very little you can do (without messing with the hardware).