Duplicated files on the left and the deduplicated files on the right after using Open-E JovianDSS system

Deduplication – The Global Compression

May 29, 2024
No Comments

The society we live in bases its functionality on information. People have become dependent on it. What’s more, deepening this process, they are generating more and more data. Therefore, the ever-increasing need to enhance the availability of storage space has become one of the biggest problems in today’s IT industry. If we consider the backup process – which doubles this need and thus the cost of maintaining IT infrastructure – it becomes apparent that the data explosion problem requires immediate action. What can be the “antidote” to this?

Deduplication – the Compression Alternative

If we look at the foregone question, we can point at two ways to solve this problem: we may try to increase the capacity of storage devices (on the hardware level) or find a way to organize data storage so it helps to optimize the consumption of capacity (on the software level). Let’s think about the second idea – in this case, optimizing by data deduplication. The process helps storage backups in a “thrifty” way. How should we understand it? What makes deduplication a better tool than hardware data compression, incremental development techniques, or differential backups? Let us take a closer look at it.

In short, deduplication is a process that eliminates duplicated data and replaces it with links that lead to one portion of the original data. Easy, isn’t it? It is, without a doubt. What is more interesting, however, is that while the deduplication process is dedicated to virtual mass storage, it can also be used in database systems or other applications.

How Does It Work?

Theoretically, the process of deduplication is quite simple. It is based on systematic searches for repeated data blocks, eliminating them and replacing them with references to a single remaining copy of the data in the system. The process can take place on a file system level and also at the level of disk blocks, which allows for obtaining better results since it is independent of the type and quantity of files in the file system and the operating system on which the system works. We will return to this issue later.

The process’s complexity increases if we consider the “smart” side of this solution. This is because deduplication also relies on finding the same results despite any record differences, errors, and typos. It means that it is not necessary to find exact duplicates. Such a solution is possible thanks to an advanced algorithm that evaluates the similarity between data blocks. When the search is finished, similar records are eligible for one of three groups: identical, similar, and different records. That is how it works – in general.

Deduplication vs. Compression – the Difference

There is one main difference between the mentioned methods. When we use compression, the process takes place at the file level, so it does not matter if we compress many or only one file. In the case of deduplication, it applies to all the files it includes. It means repeated data blocks are searched globally- not in one file area. Such a solution is unquestionably more efficient than compression.

Models

Deduplication is a method, an idea. But if we decide on such a solution, we also need to choose between “models” (we can also call them “projects”) – we can point, i.e., at Opendedup, LessFS, BitWackr (Exar), etc. It all depends on the characteristics of individual solutions, on which we will decide – it is hard to point at only one correct solution. The optimal way to find the best solution possible is by taking into account their different properties.

Some of the existing solutions are based on a model which binds deduplication and compression approaches. The range of these solutions and their quality in this area is extensive. It may be subjected to adopted technology, programming language, etc. Their level of performance may be a result of deduplication efficiency. Also, read and write speeds can differ – depending on the deduplication model. As it is easy to see, the choice of a proper solution should be based on the compromise between the technological capabilities of the implementation and the requirements in the performance area.

Another thing is the level of deduplication. We can highlight:

file level deduplication – the least effective model but also the simplest and the one which requires the least effort;
variable block level deduplication – where data blocks do not have a fixed size but are adjusted to catch the best possible data string;
fixed block level deduplication – in this method, a unit of data (which generates a shortcut and checks whether it is unique) is also a block. The shorter it is, the greater the number of copies that can be found and the greater the benefit in the area of the recovered space;
byte-level deduplication – in this case, data are compared byte by byte – such a solution is implemented while similar files are used (it is content-aware deduplication) – like .doc, .png, etc.

It is easy to see that all of the mentioned solutions have their own benefits and drawbacks. That is why every implementation must be adapted to specific applications – also in the case for the need for specific functionalities which are unique for each model. In addition, it is important to write also about the purpose of specific application – various models do not always work with different software – i.e. Hyper-V. So, as it was written above – a choice of appropriate solution must be a result of needs we are aware of.

Tomasz Spiegolski

Content Specialist at Open-E

A data storage technologies enthusiast with a growing passion for exploring innovative solutions and industry trends. He also co-creates a network music blog and YouTube channel, delving into sound, storytelling, and audio-visual creativity.

Leave a Comment

Featured Posts

Optimizing Data Storage Costs & Efficiency with Open-E JovianDSS

In today’s data-driven world, the importance of optimizing data storage cannot be overstated. As data continues to grow at an unprecedented rate, businesses face significant challenges in managing, storing, and ...

Data Storage Monitoring in Open-E JovianDSS with Checkmk and Diagnostic Tools

Among the characteristics of an optimal data storage solution, several features should stand out. It should provide full checksumming, self-repair, and backup and restore capabilities with short RPOs and RTOs. ...

How To Improve Your Business With ZFS

The smooth workflow of almost any business today is mainly based on data management. Media, transportation and logistics, finance, the public, government, or medical sectors – basically, you can list ...

Welcome to Open-Experts — The Data Storage Podcast!

Our charismatic host, Todd Maxwell, with almost 20 years of experience in the data storage market, delves into the world of data storage solutions. Learn about key trends, technologies, and ...

Want to Learn More?

Open-E Data Storage Calculator page

3-in-1 Complete Data Storage Solution

Accelerate Your Data Storage with ZFS-based Storage System

Start 60 Day FREE TRIAL

Open-E data storage calculator tabs

Find the Exact License for Your Storage Setup

This calculator helps you to find the exact license required for your storage setup with Open-E JovianDSS, based on your individual specification.

Enter the configuration of your choice into the calculator and generate a PDF report.

Try the Calculator

Open-E Library

Manuals and Quick Starts

How-to Resources

Video Tutorials

Courses