When you scroll through the different data storage articles and read about the file systems’ specifications, you might come across such a catchy term as “data deduplication”. In case you haven’t checked it so far – it is a file system’s feature that allows for reducing data. It can work on two levels: file- and block-level (in some specific cases, it can be a bit-level as well), which, as you can guess, depends on the types of deduplication. File-level deduplication works with single files, while the block-level – with the whole blocks. Both have their advantages: the file level takes fewer resources and thus may be deployed over larger amounts of physical storage, and the block level can eliminate chunks of data smaller than a file.
Sounds good so far, right?
However, before happily implementing the feature just because it sounds desirable, you should first analyze the data you store. Let us explain how it works and what it actually offers you, on the example of Open-E JovianDSS.
Pros of Deduplication
Used Space Reduction:
Well, as it has been stated already in the description of the term: data deduplication is a technique for eliminating repeated data. So, the main goal of deduplication is to reduce the occupation of space in data storage. How does it work? During this process, the system compares the data byte patterns (chunks), which are contiguous blocks of data. These chunks are compared to other chunks within existing data to find matching patterns. When the system claims “It’s a match!”, the redundant chunk is replaced with a small-size reference that points to the “twin”. Also, it can effectively reduce the used capacity, even up to the disk number, as the deduplication ratio is higher than 1:1.
Due to the duplicates’ elimination, there is a much greater chance to improve the I/O performance in terms of write operations. The more data is deduplicated, the fewer data there is left to be written, as the system records just the references to the previous files.
Cons of Deduplication
As we’ve mentioned already, you should analyze your business requirements and the data you’d like to store on your data storage solution. If the data is not duplicable (which means that the data chunks are unique and cannot be reduced), this feature can turn out to be useless. It will affect neither the capacity nor the performance on a noticeable level.
In addition, in some cases deduplication influences the read performance of your system. For example, during a restore operation, performance of a deduplicated storage pool can be slower than a restore from a non-deduplicated pool. When used, data deduplication can spread the extent of a given file across multiple volumes on the server.
When it comes to an automatic data comparison on all deduplication levels, the system counts the binary codes to state if the bites/files/blocks are the same. This way, if two random pieces of data are stated as identical, it can not always be true and it may lead to data loss after deduplication. The smaller the deduplication level, the smaller the risk of data loss.
IMPORTANT: If you decide to turn the deduplication off, for instance in order to restore the performance level, those settings will concern only the new data. The old files will remain unchanged since disabling the deduplication process doesn’t reduplicate the data. The deduplicated files remain deduplicated so it won’t improve the storage performance. To do so you can use the ZFS send and ZFS receive features, and send the data to a dataset that doesn’t use deduplication.
Types of data suitable for deduplication:
File servers with a large number of copies of the same data
Virtualized infrastructure with plenty of generic virtual machines
Email servers (the same attachments, repeatable parts of the messages, like email footers)
Media files with numerous versions of the same projects
Types of data not recommended for deduplication:
Already deduplicated data
Highly non-repeatable data, where every file of a piece of data is unique
Before enabling the Open-E JovianDSS deduplication feature, you should judge by yourself if it meets the parameters of the data utilized in your company. From our perspective, if you still plan to utilize deduplication in your system, we recommend testing your data set with the deduplication capabilities, as well as checking the performance before implementing it on the production server.