Updated on 03/08/2021
Due, in part, to the different views and opinions regarding the usage of hot spare disks in our previous post, we’ve decided to add an update for clarification.
The Problematic Aspects of Using a Hot Spare Disk
As is said in almost every theory, using a hot spare disk with ZFS, Solaris FMA or in any other data storage environment is a good solution as it will automatically react to damage in a Redundant Array of Independent Disks (RAID) array and a hot spare disk indeed helps to minimize the duration of a degraded array state.
That being said, our goal of creating a RAID is to continue operation and not lose data in the event of a disk failure. Anything that increases the risk of data loss could be a bad idea. Let’s have a look at some of these problematic aspects of hot spare disks.
Hot Spare Disks Add Stress to Vulnerable Systems
The main problem with hot spare disks is that they allow the rebuilding (resilvering) of a system that is still actively being used as a production server. This means that, while the resilvering process is taking place, the system will also still be occupied with the usual production data reads and writes.
Resilvering is a process that needs a lot of server resources so when it’s executed while the server is still in use, it has to compete with the production loads. Since it’s a low-priority task, it can make the entire process of resilvering take very long (even up to a few weeks). This results in the server working at maximum achievable throughput for weeks, which can have dire consequences for the disks (especially HDDs).
Having decades worth of experience, we’ve realized that the use of hot spare disks in complex enterprise systems increases the probability of additional disks failing as the resilvering process starts to put more and more stress on the existing disks and the system itself.
Problems in Overall Hot Spare Disk Design
The next flaw of a hot spare disk is that it degrades over time. From the moment it is connected to the system, it keeps on working. And when, eventually, it’s time for it to be used as a damaged disk’s replacement; the hot spare disk itself could simply not be in a good enough state to actually replace the damaged disk.
Another problematic aspect of hot spare disks is that they are used automatically once the disk failure is detected so the corrupted disk might still be connected to the system. It could still try to reconnect and start working again while the hot spare disk is trying to take over its role thus adding even more stress to the system. This is yet another factor that can affect the system’s overall performance and could potentially lead to data loss.
Hot Spare Disks Create a Single Point of Failure
If you’re looking to create a system with no single point of failure, a hot spare disk will not provide you with much confidence given that the process of automatically replacing a failed disk has been known to occasionally fail, either partially or fully, and result in data loss.
Having spent decades providing customers with data storage solutions, we’ve heard of a lot of examples where a hot spare disk was the reason for the entire server failure and even data loss occurring. Automation here is risky since it can start the domino effect, especially when the data storage infrastructure has been working for years and the hardware is worn out.
Our Solution
These problematic aspects of hot spare disks are why our advice would be to not rely on hot spare disks in complex data storage architectures and to use other business continuity solutions instead like High Availability (HA) clusters, backups and On- & Off-site Data Protection (ideally all of the aforementioned).
Using the ZFS file system, it’s much easier to monitor the system and create a proper backup, with that you have the ability to retrieve data from a damaged disk and write it onto a new one. In addition to that, when using a HA cluster, there is an option of manually switching the production from the affected node to a second one so that you could perform maintenance on the affected node.
We’d advise following this procedure once the array shows that a degraded state has occurred as a result of a disk failure:
- Move resources to the second node in your HA cluster if possible.
- Run a full data backup.
- Verify the backed-up data for consistency, and verify whether the data restore mechanism works.
- Identify the problem source, i.e., find the erroneous hard disk. If possible, shut down the server and make sure the serial number of the hard disk matches the one that’s reported by the event viewer or system logs..
- Replace the hard disk identified as bad with a new, unused one. If the replacement hard disk had already been used within another RAID array; make sure that any residual RAID metadata on it has been deleted via the original RAID controller.
- Start a rebuild of the system.
So, if using this approach, the rebuild would consist of 6 steps! Using a hot spare disk, your RAID will skip the first four significant steps and then automatically run steps 5, and 6. Thus the rebuild will be completed before you can do these other critical steps; steps that could be the difference between your data being safe and lost.
Anyway, it’s still completely up to you as to how to build a proper system. However, we’d suggest not relying on hot spare disks in a ZFS RAID array due to the potential data loss it can cause.
49 Comments
matthew /
10, 08 2016 10:58:59Let me posit some reasons why a hot spare is a *really* good idea.
RAID array fails while no-one is at work (does your site remain manned 24×365, even in the event of a fire alarm or other security alert?), you’re running at risk of data loss until the situation is addressed.
SMART monitoring on the controller spots the disk is failing *before* it goes offline and fails to the hot spare with no risk of data loss and no downtime. I know it won’t catch all failures, but on the HP kit I’ve worked on the majority of failures (media failures as opposed to electronics failures) have been spotted and fixed on the fly while the disk was still usable.
In over 20 years I have *once* had an array fail a second disk while recovering onto a hot spare. I have lost count of the number of times that a hot spare has saved the day…
I have worked on one customer’s system where the transactional traffic was so high that they would not restore from backup… the lost income from the restore time outweighed the costs of abandoning the data and starting again with an empty database… for them, hot spares provided a better financial risk than an outage to perform steps (1) and (2)
If your data is that critical, you should use some RAID that allows multiple devices to fail… and probably have more than one hot spare available.
So I’m afraid my real-world experience teaches me that your article is really not a good way to go for every business. There may be edge cases where a hot spare proves bad, but there are edge cases where not wearing a seatbelt in a car proved beneficial.
Luke /
11, 10 2016 04:21:30I also think hot-spare is a bad idea with RAID5. Here is my reasoning….
OptionA: RAID 5 without hot-spare (3 drives in total)
OptionB: RAID 5 with a hot-spare (4 drives in total)
In OptionA: if one drive fails, you simply replace that drive ASAP and then system will rebuild failed drive. Let’s say whole rebuild will take 24hrs.
In OptionB: if one drive fails, immediately rebuild begins on a hot-spare. It will also take 24hrs. Once rebuild is complete you still need to swap the broken drive. Which will trigger another rebuild this time from hot-spare assigned drive to the freshly inserted drive – this process will also take 24hrs.
So not only you did the rebuild twice… it also took twice as much time. Effectively doubling amount of time where your Raid Array is danger of having another faulty drive. With all the extra stress caused by doing rebuild twice it’s walking on a thing ice really.
It almost feels it’s better to have a spare drive ready, kept on shelf, not as part of system (and not assigned as a hot-spare). As soon as faulty drive is detected start rebuild onto that drive. Obviously, if you cannot be next to your system everyday then maybe a hot-spare is a better option.
If you really insist on RAID 5 then maybe not having hot-spare is a safer option in this case. Unless I am missing something really obvious here.
Would love more feedback on this case.
** I know in some cases you can mark freshly rebuild hot-spare as your new drive. And then simply add another drive as a hot spare. But I am not sure if this is a default behavior for Raid Controllers. I think most raid controllers usually just rebuild back from hot-spare into a freshly slotted drive.
MJI /
01, 11 2016 01:09:37I don’t understand why option B would trigger two rebuilds. If you have a hot spare, the NAS will start an automatic rebuild, at the end of which you have a restored 3 drive setup. When you replace the broken drive, you can add the new drive as a hot spare if you want it, there is nothing that forces you to turn it into a 4 drive NAS.
Scott in Texas /
28, 12 2016 01:45:46If you use real RAID, not motherboard RAID or software RAID, you need not move the drive, and if you DID move the drive from the hot swap port to the primary port it would not result in a rebuild, it is called “Disk Roaming”.
Bill /
13, 11 2016 09:25:40I agree with many of the comments here. Not having a spare drive, as a POLICY, is retarded.
If you can’t afford a spare drive straight away, then go without one for a while. But add one later!
A company that actually values it’s data will have am automated backup mechanism in place that matches the required RPO and RTO as defined by the IT strategy – and signed off by management/executive levels.
If a disk fails in a RAID array, replace it IMMEDIATELY. Having a hot-spare accomplishes this for you.
And since you have a backup strategy in place that already satisfies the RPO/RTO of the organization, there is no need to take another backup from a DEGRADED array. If the array is a parity RAID (heaven forbid you are using this on your primary storage) then the performance will be lower than normal and you are leaving your array in this state for longer – further increasing the likelihood that another drive will fail before the rebuild is done.
If you don’t have a backup of your primary data that meets organization RPO/RTO, or your organization hasn’t even though of these things, then obviously data integrity doesn’t matter and you basically just have a load of junk on disk – so why bother with the backup if the data is just junk anyway?
Oh, and don’t forget that as well as a proper backup mechanism, you also need monitoring/notifications of the array status – so you KNOW that a disk has failed and can organize the replacement immediately.
Scott in Texas /
28, 12 2016 01:35:26Struggling with your approach. I agree with the comments above about how a failure during a write of ALL your data to a back up set is just as risky as going straight to a rebuild. I also think it moronic of you to recommend data validation AFTER you have had a failure. I run validation EVERY NIGHT for a couple of hours a night, resulting in a full validation every week. Does it stress the drives, yes it does, and if any drives get flakey, S.M.A.R.T will identify them and warn me… more importantly, I would rather validate the data BEFORE it becomes critical that it is validated, and risky to perform the validation.
I also run Raid 6, so that in the event of a failure, it will take two additional failures to lose data. So no, a 14 drive RAID 6 array is not truly a “backup” but it is damned close… not to mention the problems backing up a 22Tb array would present.
FWIW, I also run a REAL RAID controller card (3Ware 9650SE), not motherboard raid and not software raid… so yes, I sleep quite well at night having a hot swap drive and skipping your first two steps.