Hardware failure is not an exception, it is a normal process. I guess most of you would agree with this statement.
The real case we have experienced a few weeks ago is worth to describe in our blog in order to remind you about the norm or normal process. The norm is obvious if we consider a hard disk failure.
Hard disks are usually the number one case of failure events. This is the reason why all of us use any kind or RAID in order to minimize the negative impact of a hard disk failure.
In theory the hard disk as a RAID member can come with a failure and the RAID will just reject the failed drive and rebuild with a new drive.
This is often the case in practise but unfortunately not always. In some cases the drive failure can cost the RAID array to show IO-errors to the Operating System or even worse: the whole server may stop working. Exactly such problem happened to our server. On a nice Friday morning we were faced with a notice of a very frustrating problem: one of our servers is not available at all, even ping does not respond. ☹
So, we had to hard power OFF and power ON the system and we see the RAID is in degraded mode and the OS does not boot. Lucky after the next OFF/ON the OS booted. At once after the boot we made a new fresh backup of our SQL Application database, but as the server was running and the application was working properly we decided to continue with RAID rebuild during after-hours.
Unfortunately after a few hours the server hung again. This time the ping was still answering but the SQL application and the console did not react at all. After power OFF we have removed the failed hard disk and started the server again. Now it was able to boot without a problem with the degreed RAID array.
So the faulty hard disk present in the RAID was in a position to hang the system. I am not going to provide the hardware vendors of the server, RAID and hard disks as it will not help to avoid such a problem, because from my experience it can happen with all vendors.
We have addressed this problem in our iSCSI and NAS (NFS) Failover solution. In case of any IO-errors the storage will failover transparently for applications.
If you need a real business continuity please consider HA cluster systems, protect the data with a professional backup plan and remember: “Hardware failure is not an exception, it is a norm”.
1 Comment
Toni B. /
04, 06 2012 08:36:38And Friday is the norm, too! Usually at 3 PM.
Can this particular problem appear at any type of RAID (10, 5, 6)?