Data Deduplication in ZFS – Yes or Not?

Data Deduplication in ZFS – Yes or Not?

December 09, 2022
No Comments

When you scroll through the different data storage articles and read about the file systems’ specifications, you might come across such a catchy term as “data deduplication”. In case you haven’t checked it so far – it is a file system’s feature that allows for reducing data. It can work on two levels: file- and block-level (in some specific cases, it can be a bit-level as well), which, as you can guess, depends on the types of deduplication. File-level deduplication works with single files, while the block-level – with the whole blocks. Both have their advantages: the file level takes fewer resources and thus may be deployed over larger amounts of physical storage, and the block level can eliminate chunks of data smaller than a file.

Sounds good so far, right?

However, before happily implementing the feature just because it sounds desirable, you should first analyze the data you store. Let us explain how it works and what it actually offers you, on the example of Open-E JovianDSS.

Pros of Deduplication

Used Space Reduction:

Well, as it has been stated already in the description of the term: data deduplication is a technique for eliminating repeated data. So, the main goal of deduplication is to reduce the occupation of space in data storage. How does it work? During this process, the system compares the data byte patterns (chunks), which are contiguous blocks of data. These chunks are compared to other chunks within existing data to find matching patterns. When the system claims “It’s a match!”, the redundant chunk is replaced with a small-size reference that points to the “twin”. Also, it can effectively reduce the used capacity, even up to the disk number, as the deduplication ratio is higher than 1:1.

Improved Performance:

Due to the duplicates’ elimination, there is a much greater chance to improve the I/O performance in terms of write operations. The more data is deduplicated, the fewer data there is left to be written, as the system records just the references to the previous files.

Cons of Deduplication

As we’ve mentioned already, you should analyze your business requirements and the data you’d like to store on your data storage solution. If the data is not duplicable (which means that the data chunks are unique and cannot be reduced), this feature can turn out to be useless. It will affect neither the capacity nor the performance on a noticeable level.

In addition, in some cases deduplication influences the read performance of your system. For example, during a restore operation, performance of a deduplicated storage pool can be slower than a restore from a non-deduplicated pool. When used, data deduplication can spread the extent of a given file across multiple volumes on the server.

When it comes to an automatic data comparison on all deduplication levels, the system counts the binary codes to state if the bites/files/blocks are the same. This way, if two random pieces of data are stated as identical, it can not always be true and it may lead to data loss after deduplication. The smaller the deduplication level, the smaller the risk of data loss.

Deduplication disabling

IMPORTANT: If you decide to turn the deduplication off, for instance in order to restore the performance level, those settings will concern only the new data. The old files will remain unchanged since disabling the deduplication process doesn’t reduplicate the data. The deduplicated files remain deduplicated so it won’t improve the storage performance. To do so you can use the ZFS send and ZFS receive features, and send the data to a dataset that doesn’t use deduplication.

Types of data suitable for deduplication:

File servers with a large number of copies of the same data
Virtualized infrastructure with plenty of generic virtual machines
Email servers (the same attachments, repeatable parts of the messages, like email footers)
Media files with numerous versions of the same projects
Archives

Types of data not recommended for deduplication:

Already deduplicated data
Compressed data
Highly non-repeatable data, where every file of a piece of data is unique

Open-E Feedback

Before enabling the Open-E JovianDSS deduplication feature, you should judge by yourself if it meets the parameters of the data utilized in your company. From our perspective, if you still plan to utilize deduplication in your system, we recommend testing your data set with the deduplication capabilities, as well as checking the performance before implementing it on the production server.

data deduplication deduplication file system

Janusz Bak

Chief Technology Officer

Janusz Bak joined Open-E in 1999 and has been serving as Open-E's CTO ever since. Janusz has over 30 years of software engineering experience and is a recognized expert on storage technologies. Before Open-E, Janusz headed up German support operations at Aztech Systems and at Mega.

Leave a Comment

Featured Posts

Optimizing Data Storage Costs & Efficiency with Open-E JovianDSS

In today’s data-driven world, the importance of optimizing data storage cannot be overstated. As data continues to grow at an unprecedented rate, businesses face significant challenges in managing, storing, and ...

Data Storage Monitoring in Open-E JovianDSS with Checkmk and Diagnostic Tools

Among the characteristics of an optimal data storage solution, several features should stand out. It should provide full checksumming, self-repair, and backup and restore capabilities with short RPOs and RTOs. ...

How To Improve Your Business With ZFS

The smooth workflow of almost any business today is mainly based on data management. Media, transportation and logistics, finance, the public, government, or medical sectors – basically, you can list ...

Welcome to Open-Experts — The Data Storage Podcast!

Our charismatic host, Todd Maxwell, with almost 20 years of experience in the data storage market, delves into the world of data storage solutions. Learn about key trends, technologies, and ...

Want to Learn More?

Open-E Data Storage Calculator page

3-in-1 Complete Data Storage Solution

Accelerate Your Data Storage with ZFS-based Storage System

Start 60 Day FREE TRIAL

Open-E data storage calculator tabs

Find the Exact License for Your Storage Setup

This calculator helps you to find the exact license required for your storage setup with Open-E JovianDSS, based on your individual specification.

Enter the configuration of your choice into the calculator and generate a PDF report.

Try the Calculator

Open-E Library

Manuals and Quick Starts

How-to Resources

Video Tutorials

Courses