Deduplication, a feature we have in Open-E JovianDSS, sounds very useful and many storage marketing enthusiasts believe they cannot live without it. Admittedly, it does seem very attractive, largely due to three key benefits that it provides. First, it lets you reduce the space you use. Second, it allows you to save on the upload bandwidth. Lastly, it improves your backups and cloned virtual machines’ performance.
Everything is good so far, but is there any risk to using deduplication? Well we can actually learn a lot about it from the Wikipedia article (https://en.wikipedia.org/wiki/Data_deduplication). First of all, data deduplication does not guarantee data integrity, so keep that in mind. Second, the probability of collision occurring is very low. So-called collision is where two different pieces of data will have the same hash value.
The last significant point is that it’s dependent on the algorithm that’s used and circumstances. In general, deduplication can provide real benefits if the data is really often duplicated and if deduplication works on the application level and not on the file system level. Typical applications that highly benefit from deduplication include backups and e-mail storage in cases where plenty of the e-mails have big, identical attachments. In other circumstances, much of the benefits of having deduplication are lost.
Community Opinions on Deduplication
A while ago, I was reading a fascinating post on a newsgroup. There was an interesting discussion there about deduplication. Below is a short quote from Josef Bacik, a participant in the aforementioned discussion, about his experiences using deduplication with regular data:
> On ke, 2011-01-05 at 14:46 -0500, Josef Bacik wrote:
> > Blah blah blah, I’m not having an argument about which is better because I
> > simply do not care. I think dedup is silly to begin with, and online dedup even
> > sillier. The only reason I did offline dedup was because I was just toying
> > around with a simple userspace app to see exactly how much I would save if I did
> > dedup on my normal system, and with 107 gigabytes in use, I’d save 300
> > megabytes. I’ll say that again, with 107 gigabytes in use, I’d save 300
> > megabytes. So in the normal user case dedup would have been whole useless to
> > me.
It’s always good to know how deduplication works when on a file system level using regular, NOT specially prepared, dedup benchmarking data.
You’ll most likely get a great demonstration of how well deduplication works on a file system level if it’s done by intelligent, motivated marketing folks. They’ll even prove that a 90% data reduction is possible, but please be aware that that’s not always the case. It could very well be much lower, like the 0.36% in Joseph’s case.
One more thing, inline deduplication will demonstrate outstanding performance with specially prepared duplicated data on an almost empty volume. In real use cases involving regular data and volumes full of data, you’ll almost always experience a huge drop in performance.