Deduplication sounds very nice and many storage marketing enthusiasts believe they cannot live without it. Yes, it shines with three factors: reducing space, saving the upload bandwidth and improving backup and cloning virtual machines performance. So far very good, but is there any risk of using deduplication? From a Wikipedia article (https://en.wikipedia.org/wiki/Data_deduplication) about Data Deduplication we can learn a lot about it. Data Deduplication does not guarantee data integrity, so keep that in mind. So called Collision is where two different pieces of data will have the same hash value is possible, almost like you will win the lottery (I am exaggerating a little as the probability of such Collision is very low, but…). Next very important point is the algorithm that is used. In general deduplication provides real benefits if the data is really often duplicated and if deduplication works on the application level and not on the file system level. Typical applications are backups or e-mail storage where plenty of e-mails have identical big attachments. Over a year ago I was reading a very interesting post on a newsgroup. There is a discussion going on about deduplication here and if you have no time to read the discussion a short quote below is from Josef Bacik about his experience from using deduplication with regular data:
“”” From : https://thread.gmane.org/gmane.comp.file-systems.btrfs/8448 > On ke, 2011-01-05 at 14:46 -0500, Josef Bacik wrote: > > Blah blah blah, I’m not having an argument about which is better because I > > simply do not care. I think dedup is silly to begin with, and online dedup even > > sillier. The only reason I did offline dedup was because I was just toying > > around with a simple userspace app to see exactly how much I would save if I did > > dedup on my normal system, and with 107 gigabytes in use, I’d save 300 > > megabytes. I’ll say that again, with 107 gigabytes in use, I’d save 300 > > megabytes. So in the normal user case dedup would have been whole useless to > > me. “””
I guess it is good to know what deduplication may bring while working on the file system level and using regular and NOT specially prepared dedup benchmarking data. You will most likely get a great demonstration of how great deduplication works on a file system level if done by smart motivated marketing folks. They will even prove 90% data reduction, but please be aware that in your case it can be as low as 0.3% as in Joseph’s case. One more thing, inline deduplication will demonstrate very good performance with specially prepared duplicated data and demonstrating on an almost empty volume. In case of regular data and a volume being full of data you will experience a huge drop on performance.