{"id":52328,"date":"2022-12-09T10:27:03","date_gmt":"2022-12-09T09:27:03","guid":{"rendered":"https:\/\/www.open-e.com\/blog\/?p=52328"},"modified":"2025-06-18T08:21:49","modified_gmt":"2025-06-18T08:21:49","slug":"data-deduplication-in-zfs-yes-or-not","status":"publish","type":"post","link":"https:\/\/www.open-e.com\/blog\/data-deduplication-in-zfs-yes-or-not\/","title":{"rendered":"Data Deduplication in ZFS \u2013 Yes or Not?"},"content":{"rendered":"<p>\t\t\t\t<span style=\"font-weight: 400;\">When you scroll through the different data storage articles and read about the file systems\u2019 specifications, you might come across such a catchy term as \u201c<\/span><b>data deduplication<\/b><span style=\"font-weight: 400;\">\u201d. In case you haven\u2019t checked it so far &#8211; it is a <a href=\"https:\/\/www.open-e.com\/blog\/zfs-summary\/\">file system<\/a>\u2019s feature that <\/span><b>allows for reducing <\/b><span style=\"font-weight: 400;\">data. It can work on two levels: <\/span><b>file- and block-level <\/b><span style=\"font-weight: 400;\">(in some specific cases, it can be a <\/span><a href=\"https:\/\/www.mckusick.com\/bookrefs\/zfs_dedup.html\"><span style=\"font-weight: 400;\">bit-level<\/span><\/a><span style=\"font-weight: 400;\"> as well), which, as you can guess, depends on the types of deduplication. File-level deduplication works with single files, while the block-level &#8211; with the whole blocks. Both have their advantages: the file level takes fewer resources and thus may be deployed over larger amounts of physical storage, and the block level can eliminate chunks of data smaller than a file.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Sounds good so far, right?<\/span><\/p>\n<p><span style=\"font-weight: 400;\">However, before happily implementing the feature just because it sounds desirable, you should first analyze the data you store. Let us explain how it works and what it actually offers you, on the example of Open-E JovianDSS.\u00a0<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Pros of Deduplication<\/span><\/h2>\n<h4>Used Space Reduction:<\/h4>\n<p><span style=\"font-weight: 400;\">Well, as it has been stated already in the description of the term: data deduplication is a technique for eliminating repeated data. So, the main goal of deduplication is to reduce the occupation of space in data storage. How does it work? During this process, the system compares the data byte patterns (chunks), which are contiguous blocks of data. These chunks are compared to other chunks within existing data to find matching patterns. When the system claims \u201cIt\u2019s a match!\u201d, the redundant chunk is replaced with a small-size reference that points to the \u201ctwin\u201d. Also,<\/span><b> it can effectively reduce the used capacity, even up to the disk number<\/b><span style=\"font-weight: 400;\">, as the <\/span><a href=\"https:\/\/forum.huawei.com\/enterprise\/en\/why-does-deduplication-and-compression-affect-performance\/thread\/564049-891\"><span style=\"font-weight: 400;\">deduplication ratio is higher than 1:1<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<h4>Improved Performance:<\/h4>\n<p><span style=\"font-weight: 400;\">Due to the duplicates\u2019 elimination, there is a much greater chance to improve the I\/O performance in terms of write operations.<\/span><b> The more data is deduplicated, the fewer data there is left to be written<\/b><span style=\"font-weight: 400;\">, as the system records just the references to the previous files.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Cons of Deduplication<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">As we\u2019ve mentioned already, you should analyze your business requirements and the data you\u2019d like to store on your data storage solution.<\/span><b> If the data is not duplicable (which means that the data chunks are unique and cannot be reduced), this feature can turn out to be useless<\/b><span style=\"font-weight: 400;\">. It will affect neither the capacity nor the performance on a noticeable level.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In addition, in some cases deduplication influences the read performance of your system. For example, <\/span><b>during a restore operation, performance of a deduplicated storage pool can be slower than a restore from a non-deduplicated pool<\/b><span style=\"font-weight: 400;\">. When used, data deduplication can spread the extent of a given file across multiple volumes on the server.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">When it comes to an automatic data comparison on all deduplication levels, the system counts the binary codes to state if the bites\/files\/blocks are the same. <\/span><b>This way, if two random pieces of data are stated as identical, it can not always be true and it may lead to data loss after deduplication<\/b><span style=\"font-weight: 400;\">. The smaller the deduplication level, the smaller the risk of data loss.\u00a0\u00a0<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Deduplication disabling<\/span><\/h2>\n<p><b>IMPORTANT: If you decide to turn the deduplication off, for instance in order to restore the performance level, those settings will concern only the new data<\/b><span style=\"font-weight: 400;\">. The old files will remain unchanged since disabling the deduplication process doesn\u2019t reduplicate the data. The deduplicated files remain deduplicated so it won\u2019t improve the storage performance. To do so you can use the <\/span><i><span style=\"font-weight: 400;\">ZFS send<\/span><\/i><span style=\"font-weight: 400;\"> and <\/span><i><span style=\"font-weight: 400;\">ZFS receive<\/span><\/i><span style=\"font-weight: 400;\"> features, and send the data to a dataset that doesn\u2019t use deduplication.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Types of data suitable for deduplication:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><strong>File servers with a large number of copies of the same data<\/strong><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><strong>Virtualized infrastructure with plenty of generic virtual machines<\/strong><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><strong>Email servers (the same attachments, repeatable parts of the messages, like email footers)<\/strong><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><strong>Media files with numerous versions of the same projects<\/strong><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><strong>Archives<\/strong><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">Types of data not recommended for deduplication:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><strong>Already deduplicated data<\/strong><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><strong>Compressed data<\/strong><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><strong>Highly non-repeatable data, where every file of a piece of data is unique<\/strong><\/li>\n<\/ul>\n<h2><span style=\"font-weight: 400;\">Open-E Feedback<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Before enabling the <\/span><a href=\"https:\/\/www.open-e.com\/products\/jovian-data-storage-software\/general-information\/\"><span style=\"font-weight: 400;\">Open-E JovianDSS<\/span><\/a><span style=\"font-weight: 400;\"> deduplication feature, you should judge by yourself if it meets the parameters of the data utilized in your company. From our perspective, if you still plan to utilize deduplication in your system, we recommend testing your data set with the deduplication capabilities, as well as checking the performance before implementing it on the production server.\u00a0<\/span>\t\t<\/p>\n","protected":false},"excerpt":{"rendered":"<p>When you scroll through the different data storage articles and read about the file systems\u2019 specifications, you might come across such a catchy term as \u201cdata deduplication\u201d. In case you&nbsp;&#8230;<\/p>\n","protected":false},"author":2,"featured_media":55686,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[27],"tags":[186,212,263],"class_list":["post-52328","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-open-e-joviandss","tag-data-deduplication","tag-deduplication","tag-file-system"],"acf":[],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.open-e.com\/blog\/wp-json\/wp\/v2\/posts\/52328","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.open-e.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.open-e.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.open-e.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.open-e.com\/blog\/wp-json\/wp\/v2\/comments?post=52328"}],"version-history":[{"count":1,"href":"https:\/\/www.open-e.com\/blog\/wp-json\/wp\/v2\/posts\/52328\/revisions"}],"predecessor-version":[{"id":55687,"href":"https:\/\/www.open-e.com\/blog\/wp-json\/wp\/v2\/posts\/52328\/revisions\/55687"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.open-e.com\/blog\/wp-json\/wp\/v2\/media\/55686"}],"wp:attachment":[{"href":"https:\/\/www.open-e.com\/blog\/wp-json\/wp\/v2\/media?parent=52328"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.open-e.com\/blog\/wp-json\/wp\/v2\/categories?post=52328"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.open-e.com\/blog\/wp-json\/wp\/v2\/tags?post=52328"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}