Updated on 03/08/2021

Due, in part, to the different views and opinions regarding the usage of hot spare disks in our previous post, we’ve decided to add an update for clarification. 

The Problematic Aspects of Using a Hot Spare Disk

As is said in almost every theory, using a hot spare disk with ZFS, Solaris FMA or in any other data storage environment is a good solution as it will automatically react to damage in a Redundant Array of Independent Disks (RAID) array and a hot spare disk indeed helps to minimize the duration of a degraded array state. 

That being said, our goal of creating a RAID is to continue operation and not lose data in the event of a disk failure. Anything that increases the risk of data loss could be a bad idea. Let’s have a look at some of these problematic aspects of hot spare disks.

Hot Spare Disks Add Stress to Vulnerable Systems

The main problem with hot spare disks is that they allow the rebuilding (resilvering) of a system that is still actively being used as a production server. This means that, while the resilvering process is taking place, the system will also still be occupied with the usual production data reads and writes. 

Resilvering is a process that needs a lot of server resources so when it’s executed while the server is still in use, it has to compete with the production loads. Since it’s a low-priority task, it can make the entire process of resilvering take very long (even up to a few weeks). This results in the server working at maximum achievable throughput for weeks, which can have dire consequences for the disks (especially HDDs).

Having decades worth of experience, we’ve realized that the use of hot spare disks in complex enterprise systems increases the probability of additional disks failing as the resilvering process starts to put more and more stress on the existing disks and the system itself

Problems in Overall Hot Spare Disk Design

The next flaw of a hot spare disk is that it degrades over time. From the moment it is connected to the system, it keeps on working. And when, eventually, it’s time for it to be used as a damaged disk’s replacement; the hot spare disk itself could simply not be in a good enough state to actually replace the damaged disk.

Another problematic aspect of hot spare disks is that they are used automatically once the disk failure is detected so the corrupted disk might still be connected to the system. It could still try to reconnect and start working again while the hot spare disk is trying to take over its role thus adding even more stress to the system. This is yet another factor that can affect the system’s overall performance and could potentially lead to data loss.  

Hot Spare Disks Create a Single Point of Failure

If you’re looking to create a system with no single point of failure, a hot spare disk will not provide you with much confidence given that the process of automatically replacing a failed disk has been known to occasionally fail, either partially or fully, and result in data loss. 

Having spent decades providing customers with data storage solutions, we’ve heard of a lot of examples where a hot spare disk was the reason for the entire server failure and even data loss occurring. Automation here is risky since it can start the domino effect, especially when the data storage infrastructure has been working for years and the hardware is worn out. 

Our Solution

These problematic aspects of hot spare disks are why our advice would be to not rely on hot spare disks in complex data storage architectures and to use other business continuity solutions instead like High Availability (HA) clusters, backups and On- & Off-site Data Protection (ideally all of the aforementioned). 

Using the ZFS file system, it’s much easier to monitor the system and create a proper backup, with that you have the ability to retrieve data from a damaged disk and write it onto a new one. In addition to that, when using a HA cluster, there is an option of manually switching the production from the affected node to a second one so that you could perform maintenance on the affected node. 

We’d advise following this procedure once the array shows that a degraded state has occurred as a result of a disk failure:

  1. Move resources to the second node in your HA cluster if possible.
  2. Run a full data backup.
  3. Verify the backed-up data for consistency, and verify whether the data restore mechanism works.
  4. Identify the problem source, i.e., find the erroneous hard disk. If possible, shut down the server and make sure the serial number of the hard disk matches the one that’s reported by the event viewer or system logs..
  5. Replace the hard disk identified as bad with a new, unused one. If the replacement hard disk had already been used within another RAID array; make sure that any residual RAID metadata on it has been deleted via the original RAID controller.
  6. Start a rebuild of the system.

So, if using this approach, the rebuild would consist of 6 steps! Using a hot spare disk, your RAID will skip the first four significant steps and then automatically run steps 5, and 6. Thus the rebuild will be completed before you can do these other critical steps; steps that could be the difference between your data being safe and lost.

Anyway, it’s still completely up to you as to how to build a proper system. However, we’d suggest not relying on hot spare disks in a ZFS RAID array due to the potential data loss it can cause. 

 

49 Comments

  • Martino87r /

    03, 05 2012 09:54:50

    From personal experience, working several years on Enterprise storage like Equallogic, dell MD3000, EMC and others… I cannot agree with not having a hot spare!
    Normally if you run integrity checks regularly along with patrol reads all the time the probability of broken bytes on the disks is not that big.
    Not having a hot spare on a raid system composed of 15/20 disks means that potentially you’re using a RAID 0 once a drive has failed and the more disks you have, higher is the probability of another failing shortly, especially today that people build up arrays composed of disks from the same production batches (likely to have the same MTBF).
    In some arrays i even configured 2 hot spare drives as the arrays were located far and the replacement time for a drive was quite high (at least a day)
    I don’t agree as well backing up an array which is in degraded state as this stresses the drives in the same manner (and even more, due to the random data positioning on the platters and the excessive head movement in order to read fragmented files), normally you should have a backup strategy that keeps your data safe (like asynchronous replication or snapshots on another array)

    Reply
    • Janusz Bak /

      07, 05 2012 08:13:37

      In case your data is continuously protected and you use very good quality hardware and good monitoring system, shorter time of running array in degraded mode statistically proof for the hot spare. This is why hot spare was good selling point.
      The blog was created to make people understand that relying ONLY on hot-spare disks is not a good idea. You should always have your data backed up somewhere in different location.

      Reply
      • user /

        02, 07 2013 12:56:54

        You should have made that clear in the article then, because the article implies it is better to backup a degraded disk directly before a rebuild.

        Reply
  • Walter /

    15, 05 2012 02:57:06

    Thanks for the info!
    Regards

    Reply
  • athar13 /

    05, 02 2013 04:38:05

    Being the IT manager of a group of companies in the VAS industry, I beg to differ very strongly. I guess what is a bliss for one can be a blister for someone else. We do not have the luxury to down our database for a backup, swap and rebuild as the 8 disks are 1TB each and a rebuild would be way too painstakingly time consuming, (approximately a day!!!). The best RAID for an active server would be RAID5+hot spare and an archive server would be RAID10+Global spare. The RAID5+Hot Spare would give you a fault tolerance of 2 disk failures and the RAID10+GSP would give a fault tolerance of 5 disk failures (provided 1 disk of each group fails).

    Reply
  • Christophe /

    08, 03 2013 07:48:48

    I do agree with the risk of one extra disk failing while rebuilding, however I do not agree at all with your procedure.

    First of all the thing of taking a backup is not a good idea:
    – You are as well putting stress on all your disks
    – You need additional storage for your “full backup”, which is not possible depending on your RAID size. I must admit I don’t have a spare machine with a spare 50TB (used data).
    – Your backup strategy should be designed beforehand and backups should be existing, independently from the state of your RAID. A RAID system IS a single point of failure in the case of a power-surge, so you should consider it could die every day.

    From our experience shutting down/booting a system poses also a very big risk for disks. So if you’re already in a degraded state you don’t want to take that additional risk.

    Your system should support hot-swappable disks, which is the case for any SATA these days.
    The goal of a RAID is to have high availability so you don’t really want to bring your system down for a simple disk-swap.

    Hoping that this experience gives another view.

    Reply

Leave a Comment