Updated on 03/08/2021

Due, in part, to the different views and opinions regarding the usage of hot spare disks in our previous post, we’ve decided to add an update for clarification. 

The Problematic Aspects of Using a Hot Spare Disk

As stated in almost every theory, using a hot spare disk with ZFS, Solaris FMA, or in any other data storage environment is a good solution, as it automatically responds to malfunctions in a Redundant Array of Independent Disks (RAID) and helps minimize the duration of a degraded array state.”

That being said, the primary goal of creating a RAID is to ensure continuous operation and prevent data loss in the event of a disk failure. Therefore, anything that increases the risk of data loss could be considered a bad idea. Let’s take a closer look at some of the problematic aspects of using hot spare disks.

Hot Spare Disks Add Stress to Vulnerable Systems

The primary issue with hot spare disks is that they enable the rebuilding (resilvering) of a system that is still in active use as a production server. While the resilvering process is taking place, the system will also continue to process the usual production data reads and writes. 

Resilvering is a process that consumes significant server resources. When executed while the server is still in use, it must compete with production workloads. Because it is treated as a low-priority task, the resilvering process can take an extended amount of time – sometimes even several weeks. This prolonged operation at maximum throughput can put considerable strain on the disks, especially HDDs, and may lead to serious wear or potential failures.

Having decades worth of experience, we’ve realized that the use of hot spare disks in complex enterprise systems increases the probability of additional disks failing as the resilvering process starts to put more and more stress on the existing disks and the system itself

Problems in Overall Hot Spare Disk Design

The next flaw of a hot spare disk is that it degrades over time. From the moment it is connected to the system, it keeps on working. And when, eventually, it’s time for it to be used as a damaged disk’s replacement; the hot spare disk itself could simply not be in a good enough state to actually replace the damaged disk.

Another issue with hot spare disks is that they are activated automatically when a disk failure is detected, even if the failed disk is still connected to the system. The faulty disk might attempt to reconnect and operate again while the hot spare is taking over its role, creating additional stress on the system. This can impact overall performance and, in some cases, increase the risk of data loss.

Hot Spare Disks Create a Single Point of Failure

If you’re looking to create a system with no single point of failure, a hot spare disk will not provide you with much confidence given that the process of automatically replacing a failed disk has been known to occasionally fail, either partially or fully, and result in data loss. 

Having spent decades providing customers with data storage solutions, we’ve heard of a lot of examples where a hot spare disk was the reason for the entire server failure and even data loss occurring. Automation here is risky since it can start the domino effect, especially when the data storage infrastructure has been working for years and the hardware is worn out. 

Our Solution

These problematic aspects of hot spare disks are why our advice would be to not rely on hot spare disks in complex data storage architectures and to use other business continuity solutions instead like High Availability (HA) clusters, backups and On- & Off-site Data Protection (ideally all of the aforementioned). 

Using the ZFS file system, it’s much easier to monitor the system and create a proper backup, with that you have the ability to retrieve data from a damaged disk and write it onto a new one. In addition to that, when using a HA cluster, there is an option of manually switching the production from the affected node to a second one so that you could perform maintenance on the affected node. 

We’d advise following this procedure once the array shows that a degraded state has occurred as a result of a disk failure:

  1. Move resources to the second node in your HA cluster if possible.
  2. Run a full data backup.
  3. Verify the backed-up data for consistency, and verify whether the data restore mechanism works.
  4. Identify the problem source, i.e., find the erroneous hard disk. If possible, shut down the server and make sure the serial number of the hard disk matches the one that’s reported by the event viewer or system logs..
  5. Replace the hard disk identified as bad with a new, unused one. If the replacement hard disk had already been used within another RAID array; make sure that any residual RAID metadata on it has been deleted via the original RAID controller.
  6. Start a rebuild of the system.

So, if using this approach, the rebuild would consist of 6 steps! Using a hot spare disk, your RAID will skip the first four significant steps and then automatically run steps 5, and 6. Thus the rebuild will be completed before you can do these other critical steps; steps that could be the difference between your data being safe and lost.

Anyway, it’s still completely up to you as to how to build a proper system. However, we’d suggest not relying on hot spare disks in a ZFS RAID array due to the potential data loss it can cause. 

45 Comments

  • BoonHong /

    24, 08 2011 07:04:33

    Another harddisk may failed during backup, making the whole array unusable as well. Ideally we should have RAID 6 that allows 2 harddisks failure. Having 3 harddisks failure within the rebuild windows is highly unlikely.

    Moreover, most users can’t afford to have a double set of storage for doing a full backup that already contain a previous full backup.

    Reply
  • peter /

    24, 08 2011 11:46:45

    Hot Spare a bad idea?
    Depending on the type and size of the RAID array it can also be a very good idea.
    If for example you have a mirrored configuration where the primary array gets critical, you dont need to run a full backup.’
    If for example you have a RAID 6 configuration there is still no critical need for a backup when one diskk fails.
    If for example you have a RAID 50 array the time to make a full backup can exeed the time permitted to be in a critiacla situation.
    So if a hotspare is a good idea depends on the configuration used, the type of arry chosen and the SLA’s

    Reply
  • Matthias /

    07, 09 2011 05:13:35

    At this time we are evaluating a new iSCSI solution for a costumer. The performance of open-e is verry good. When the tests has been finished and our costumer is happy with this solution, the system goes to Africa. In Africa we have nobody, who’s able to make a exchange like this, when a disk has been crashed. But they are at least able to change a hard disk… The raid controllers we build in this systems, are able to make a RAID 5EE. I haven’t tested the performance until now, but I think that the stress for the system is fewer than a totally rebuild of a hotspare disk.
    Sorry about my english, it’s not the best….

    Reply
  • Kai-Uwe /

    07, 09 2011 05:32:42

    I strongly disagree. Any IT admin keeping data on ANY kind of disk be it a simple disk or a RAID or a complex SAN storage subsystem should ALWAYS have a complete backup which is AT MOST one work day old!

    So if you start with a backup only after a disk has failed, you have definitely not done your job right before.

    I personally tend to use RAID-6 nowadays as an additional disk will not cost much and will leave room for an additional failure. Sometimes I use RAID-6 without a hot spare in addition to it but sometimes (if a drive slot and the money for the extra drive do not matter) I even add a hot spare to a RAID-6, too.

    Also, modern storage systems use “background scrubbing” to detect bad sectors in advance so that you are not hit by one in the event of a rebuild. In addition, this causes kind of a “healthy” stress on all disks to sort out the flaky ones rather soon …

    Reply

Leave a Comment