Rebuilds and Replication Against Hardware Failure Consequences
Hardware failure can be a show stopper for any shop without proper backups and hardware…
Read MoreUpdated 11/04/2022
I guess that all of us know that caching the data usually increases performance, but I wonder if all of us are aware of the risk that caching the data provides and how to minimize them? That’s the reason why I decided to write this short article.
Therefore, let us analyze the situation when we are using the device (HDD) connected to our OS via iSCSI. The server that provides this device as LU through iSCSI has a RAID controller. Connected LU to our OS called device has been formatted using NTFS. So a simple communication scheme will therefore look this way:
client OS (device formatted with NTFS) <-> iSCSI initiator <-> iSCSI target <-> RAID controller.
Now we can take a closer look at the RAID controller configuration. Most of the hardware RAID controller provides caching – please keep in mind that we analyze configuration where volatile memory RAM is used for cache. We could meet any names of such functionality calls Write Back (WB) cache, Unit Write Cache, or just Cache. Unfortunately, a lot of RAID controllers have this function even if they don’t have BBU (Battery Backup Unit). What is necessary for BBU in the situation while the cache is used on the RAID controller level? Let’s see: OS writes the data to the device connected through iSCSI and waits for confirmation that the operation has finished successfully. iSCSI initiator sends the data via LUN to the iSCSI target, which sends the data to the RAID device. This is the climax point because the iSCSI target gets confirmation from the RAID controller that the data has been written successfully and sends this information back to the iSCSI initiator, which sends this information back to the OS.
However, in this case, the data are not yet on disk drives connected to the RAID controller but in the cache. So if at this very moment we will face problems with the power supply, then we will lose the data. To minimize the risk of losing data in the described case, it is necessary to use UPS for the whole server machine, and the best will be to use a RAID controller with BBU. This maximizes data protection without sacrificing performance.
The second level of described scenario where the cache could be used and potentially could provide some risk of data loss is a configuration of LU in the iSCSI target. Similar to RAID configuration, we are able to set up WB in LU configuration while adding it to iSCSI target or other way turn off Write Through (WT, which is opposite to WB). Write sequence and waiting for confirmation will be similar but shorter: OS writes the data to the device connected through iSCSI and waits for confirmation that the operation has finished successfully. iSCSI initiator sends the data via LUN to the iSCSI target, which automatically sends back a confirmation to the iSCSI initiator that writes operation has finished successfully. In this case, only UPS can minimize the risk of data loss.
Of course, I will not describe the combination of using redundancy of power supply or UPS because this is not the goal of this article.
Let’s look at the OS device formatted with NTFS and connected to this system through an iSCSI initiator. A few times, I have faced a problem mentioned by our customer that they have written the data into the device and after it creates a snapshot on the server and makes a backup of this snapshot to the tape. After a few months, they couldn’t find changed data on tapes! This is because NTFS uses a cache that is dropped to disk every few seconds, like other filesystems. So we have at least two solutions here. First, you should wait a few seconds before making a backup of LU on the server-side or the second option is to use software for dropping NTFS cache into a device on demand, such software you can find here. If we are using Linux/UNIX OS and other filesystems with a similar iSCSI environment as described above, we can use provided system utility sync to get the same result.
The conclusion of this article is always to analyze potential risks of data loss and minimize it as much as possible by using alternative power of source and always be sure that important data which must be backup are consistent. Good luck!
This article describes the functionalities of our legacy product. Learn more about Open-E JovianDSS features on our website.
Leave a Reply