{"id":16513,"date":"2020-05-09T16:59:57","date_gmt":"2020-05-09T14:59:57","guid":{"rendered":"http:\/\/blog.open-e.com\/?p=16513"},"modified":"2025-04-07T10:38:21","modified_gmt":"2025-04-07T10:38:21","slug":"to-deduplicate-or-not-to-deduplicate","status":"publish","type":"post","link":"https:\/\/www.open-e.com\/blog\/to-deduplicate-or-not-to-deduplicate\/","title":{"rendered":"To Deduplicate or Not to Deduplicate ?"},"content":{"rendered":"<p>\t\t\t\t<strong><em>Updated 19\/11\/2021<\/em><\/strong><\/p>\n<h2><span style=\"font-weight: 400;\">How Deduplication Works<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Deduplication, a feature we have in <\/span><a href=\"https:\/\/www.open-e.com\/products\/jovian-data-storage-software\/general-information\/\"><span style=\"font-weight: 400;\">Open-E JovianDSS<\/span><\/a><span style=\"font-weight: 400;\">, sounds very useful and many storage marketing enthusiasts believe they cannot live without it. Admittedly, it does seem very attractive, largely due to three key benefits that it provides. First, it lets you reduce the space you use. Second, it allows you to save on the upload bandwidth. Lastly, it improves your backups and cloned virtual machines\u2019 performance.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Everything is good so far, but is there any risk to using deduplication? Well we can actually learn a lot about it from the Wikipedia article (<\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/Data_deduplication\"><span style=\"font-weight: 400;\">https:\/\/en.wikipedia.org\/wiki\/Data_deduplication<\/span><\/a><span style=\"font-weight: 400;\">). First of all, data deduplication does not guarantee data integrity, so keep that in mind. Second, the probability of collision occurring is very low. So-called collision is where two different pieces of data will have the same hash value.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The last significant point is that it\u2019s dependent on the algorithm that\u2019s used and circumstances. In general, deduplication can provide real benefits if the data is really often duplicated and if deduplication works on the application level and not on the file system level. Typical applications that highly benefit from deduplication include backups and e-mail storage in cases where plenty of the e-mails have big, identical attachments. In other circumstances, much of the benefits of having deduplication are lost.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Community Opinions on Deduplication<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">A while ago<\/span><span style=\"font-weight: 400;\">, I was reading a fascinating post on a newsgroup. There was an interesting discussion there about deduplication. Below is a short quote from Josef Bacik, a participant in the aforementioned discussion, about his experiences using deduplication with regular data:<\/span><\/p>\n<p><span style=\"font-weight: 400;\">\u201c\u201d\u201d<\/span><\/p>\n<p><span style=\"font-weight: 400;\">From: https:\/\/thread.gmane.org\/gmane.comp.file-systems.btrfs\/8448<\/span><\/p>\n<p><span style=\"font-weight: 400;\">&gt; On ke, 2011-01-05 at 14:46 -0500, Josef Bacik wrote:<\/span><\/p>\n<p><span style=\"font-weight: 400;\">&gt; &gt; Blah blah blah, I\u2019m not having an argument about which is better because I<\/span><\/p>\n<p><span style=\"font-weight: 400;\">&gt; &gt; simply do not care.\u00a0 I think dedup is silly to begin with, and online dedup even<\/span><\/p>\n<p><span style=\"font-weight: 400;\">&gt; &gt; sillier.\u00a0 The only reason I did offline dedup was because I was just toying<\/span><\/p>\n<p><span style=\"font-weight: 400;\">&gt; &gt; around with a simple userspace app to see exactly how much I would save if I did<\/span><\/p>\n<p><span style=\"font-weight: 400;\">&gt; &gt; dedup on my normal system, and with 107 gigabytes in use, I\u2019d save 300<\/span><\/p>\n<p><span style=\"font-weight: 400;\">&gt; &gt; megabytes.\u00a0 I\u2019ll say that again, with 107 gigabytes in use, I\u2019d save 300<\/span><\/p>\n<p><span style=\"font-weight: 400;\">&gt; &gt; megabytes.\u00a0 So in the normal user case dedup would have been whole useless to<\/span><\/p>\n<p><span style=\"font-weight: 400;\">&gt; &gt; me.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">\u201c\u201d\u201d<\/span><\/p>\n<p><span style=\"font-weight: 400;\">It\u2019s always good to know how deduplication works when on a file system level using regular, NOT specially prepared, dedup benchmarking data.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Closing Thoughts<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">You\u2019ll most likely get a great demonstration of how well deduplication works on a file system level if it\u2019s done by intelligent, motivated marketing folks. They\u2019ll even prove that a 90% data reduction is possible, but please be aware that that\u2019s not always the case. It could very well be much lower, like the 0.36% in Joseph\u2019s case.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">One more thing, inline deduplication will demonstrate outstanding performance with specially prepared duplicated data on an almost empty volume. In real use cases involving regular data and volumes full of data, you\u2019ll almost always experience a huge drop in performance.<\/span>\t\t<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Updated 19\/11\/2021 How Deduplication Works Deduplication, a feature we have in Open-E JovianDSS, sounds very useful and many storage marketing enthusiasts believe they cannot live without it. Admittedly, it does&nbsp;&#8230;<\/p>\n","protected":false},"author":2,"featured_media":45773,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[27,796],"tags":[212],"class_list":["post-16513","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-open-e-joviandss","category-zfs-data-storage","tag-deduplication"],"acf":[],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.open-e.com\/blog\/wp-json\/wp\/v2\/posts\/16513","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.open-e.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.open-e.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.open-e.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.open-e.com\/blog\/wp-json\/wp\/v2\/comments?post=16513"}],"version-history":[{"count":1,"href":"https:\/\/www.open-e.com\/blog\/wp-json\/wp\/v2\/posts\/16513\/revisions"}],"predecessor-version":[{"id":55260,"href":"https:\/\/www.open-e.com\/blog\/wp-json\/wp\/v2\/posts\/16513\/revisions\/55260"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.open-e.com\/blog\/wp-json\/wp\/v2\/media\/45773"}],"wp:attachment":[{"href":"https:\/\/www.open-e.com\/blog\/wp-json\/wp\/v2\/media?parent=16513"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.open-e.com\/blog\/wp-json\/wp\/v2\/categories?post=16513"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.open-e.com\/blog\/wp-json\/wp\/v2\/tags?post=16513"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}