Is De-Duplication Accurate?
De-Duplication is the one thing you can rely in electronic discovery. Everyone does it. The one man company, the small teams, the boutiques, and the giants. It’s like the safety briefing on a plane, everybody does it, everybody knows about it.
During the pre-flight briefing you are told that, should the worst happen, the mask will pop out above your head and you should fasten it securely around your mouth, before helping others. This is told to you before you get nuts and a glass of wine. This is such a well know part of the safety briefing that if an airline did not have those ubiquitous masks you may want to get off.
But, should the worst happen and you hurtle towards the earth, from 30,000 feet for a certain fiery death, that plastic mask will not be of much comfort, or use.
Would you use a company that did not use de-duplication? No, of course not. De-Duplication gets rid of “identical” data. It removes all of the duplicate data, so that you are only seeing only one copy of each file. De-Duplication is the nice relaxing part of every e-discovery pitch; everybody knows about it, everybody does.
But, what if the worst happens and you’re hurtling towards court and there is an “alleged” problem with some of the metadata. A computer forensics expert witness is being brought in to challenge the e-discovery information presented.
Will the pre-project briefing be of much comfort?
Electronic Discovery v Computer Forensics
The e-discovery argument is this:
All files are hashed using a highly secure algorithm called an “MD5”. Different files have different hash values. In fact even if a word document changes by just a single full stop the entire hash value will be different. Therefore files have to be truly identical for them to have the same hash value. For this reason we can, with 100% safety, remove all of the files with identical MD5 files, as we are removing only duplicate files
Enter, stage left, the computer forensics man. This man tracks down internet predators, and convicts terrorists, and knows all that good stuff about deleted files.
He states, with absolute confidence “I know more about MD5’s than you do, ED vendor” then pauses for dramatic effect.
“The MD5 is a function of the data. Of the actual data of the file, so the contents of the file need to be identical but the metadata does not”
The CF guy then proceeds to demonstrate his point through as series of fantastic videos [soon be made available YouTube].
The basic issue is this. The name of a file, the location of the file, and some of dates of the file are NOT part of the MD5 calculation.
This means that file called “MADE UP OIL RESERVES REPORT.TXT” created in 2001 could have an identical hash value to a file called “FACTUAL OIL RESERVES REPORT.TXT” created in 2002.
That’s pretty mind blow stuff.
If you are conducting a review, into dubious claims of oil reserve, about a famous oil company and a file with the name “MADE UP OIL RESERVES REPORT” is removed, as its “identical” to a file called “FACTUAL OIL RESERVES REPORT” that’s an issue a big issue. As the dates are also different that is going to caue problem as files can be missed, or misinterpreted due to the dates.
This is an extreme example but there are lots of scenarios that can be played out in your own mind about the issue of de-duplication and how the name, file path, or date of the document could be important to a case.
Daming stuff? De-Duplication should not be done?
99.9% of the time this scenario is not going to occur and de-duplication is going to work perfectly. [I have no idea about the statistics, I just made that bit up, but it’s going to be a very high number]
“Aah, but what about that 0.1%”.
Previous articles have looked at the issues of low probability errors on large data sets, and readers will know that a 0.1% errors/issue over a million documents will result in 1000 files which have issues that would fall through the net.
Even if the de-duplication issue highlighted above only has a “1 in a million chance of occurring” in a data set of 2 millions files it would be expected to happen twice. Pretty sobering stuff.
Firstly if de-duplication was not done, it would mean that there would be millions more files to review, and the review could not be achieved, financially. Therefore de-duplication has to be done, it’s not an option like sugar with tea.
Secondly, ED companies do/should/must keep track of what has been filtered out by de-duplication, or any other method. Therefore if a critical file is found, and a dates are relevant a more detailed investigation into that file and other identical files can easily be done.
Thirdly, some ED companies approach reviews in different ways. Some put all of the files into review (duplicates and unique files) and when a file is marked as “relevant”, “hot”, “ignore”, etc then the duplicate is automatically marked, but the duplicate is still available for review if required. Other companies approach de-duplication in another way. They don’t include the duplicate files, but they provide a “place holder” instead. This allows the original duplicate files to be easily traced, again resolving the problem.
With the advent of concept searching and near de-duplication this scenario is going to become more common, not less.
As long vendors keep track of documents and lawyers are aware that everything that glistens is not golden, every date is not accurate, and every duplicate document may not be “identical”, and there is clear communication between the vendors/consultants and legal teams this should not pose a problem, even on the very rare chances it does occur.
De-Duplication is good and accurate, just make sure its tracked.