Electronic Discovery: Is De-Duplication “Accurate”?

Is De-Duplication Accurate?

De-Duplication is the one thing you can rely in electronic discovery. Everyone does it. The one man company, the small teams, the boutiques, and the giants. It’s like the safety briefing on a plane, everybody does it, everybody knows about it.

During the pre-flight briefing you are told that, should the worst happen, the mask will pop out above your head and you should fasten it securely around your mouth, before helping others. This is told to you before you get nuts and a glass of wine. This is such a well know part of the safety briefing that if an airline did not have those ubiquitous masks you may want to get off.

But, should the worst happen and you hurtle towards the earth, from 30,000 feet for a certain fiery death, that plastic mask will not be of much comfort, or use.

Would you use a company that did not use de-duplication? No, of course not. De-Duplication gets rid of “identical” data. It removes all of the duplicate data, so that you are only seeing only one copy of each file. De-Duplication is the nice relaxing part of every e-discovery pitch; everybody knows about it, everybody does.

But, what if the worst happens and you’re hurtling towards court and there is an “alleged” problem with some of the metadata. A computer forensics expert witness is being brought in to challenge the e-discovery information presented.

Will the pre-project briefing be of much comfort?

Electronic Discovery v Computer Forensics

The e-discovery argument is this:

All files are hashed using a highly secure algorithm called an “MD5”. Different files have different hash values. In fact even if a word document changes by just a single full stop the entire hash value will be different. Therefore files have to be truly identical for them to have the same hash value. For this reason we can, with 100% safety, remove all of the files with identical MD5 files, as we are removing only duplicate files

Enter, stage left, the computer forensics man. This man tracks down internet predators, and convicts terrorists, and knows all that good stuff about deleted files.

He states, with absolute confidence “I know more about MD5’s than you do, ED vendor” then pauses for dramatic effect.

The MD5 is a function of the data. Of the actual data of the file, so the contents of the file need to be identical but the metadata does not

The CF guy then proceeds to demonstrate his point through as series of fantastic videos [soon be made available YouTube].

The basic issue is this. The name of a file, the location of the file, and some of dates of the file are NOT part of the MD5 calculation.

This means that file called “MADE UP OIL RESERVES REPORT.TXT” created in 2001 could have an identical hash value to a file called “FACTUAL OIL RESERVES REPORT.TXT” created in 2002.

That’s pretty mind blow stuff.

If you are conducting a review, into dubious claims of oil reserve, about a famous oil company and a file with the name “MADE UP OIL RESERVES REPORT” is removed, as its “identical” to a file called “FACTUAL OIL RESERVES REPORT” that’s an issue a big issue. As the dates are also different that is going to caue problem as files can be missed, or misinterpreted due to the dates.

This is an extreme example but there are lots of scenarios that can be played out in your own mind about the issue of de-duplication and how the name, file path, or date of the document could be important to a case.

Daming stuff? De-Duplication should not be done?


99.9% of the time this scenario is not going to occur and de-duplication is going to work perfectly. [I have no idea about the  statistics, I just made that bit up, but it’s going to be a very high number]

Aah, but what about that 0.1%”.

Previous articles have looked at the issues of low probability errors on large data sets, and readers will know that a 0.1% errors/issue over a million documents will result in 1000 files which have issues that would fall through the net.

Even if the de-duplication issue highlighted above only has a “1 in a million chance of occurring” in a data set of 2 millions files it would be expected to happen twice.  Pretty sobering stuff.

Firstly if de-duplication was not done, it would mean that there would be millions more files to review, and the review could not be achieved, financially. Therefore de-duplication has to be done, it’s not an option like sugar with tea.

Secondly, ED companies do/should/must keep track of what has been filtered out by de-duplication, or any other method. Therefore if a critical file is found, and a dates are relevant  a more detailed  investigation into that file and other identical files can easily be done.

Thirdly, some ED companies approach reviews in different ways. Some put all of the files into review (duplicates and unique files) and when a file is marked as “relevant”, “hot”, “ignore”, etc then the duplicate is automatically marked, but the duplicate is still available for review if required. Other companies approach de-duplication in another way. They don’t include the duplicate files, but they provide a “place holder” instead. This allows the original duplicate files to be easily traced, again resolving the problem.

With the advent of concept searching and near de-duplication this scenario is going to become more common, not less.

As long vendors keep track of documents and lawyers are aware that everything that glistens is not golden, every date is not accurate, and every duplicate document may not be “identical”, and there is clear communication between the vendors/consultants and legal teams this should not pose a problem, even on the very rare chances it does occur.


De-Duplication is good and accurate, just make sure its tracked.


Electronic Discovery: What is DeDuplication

What is de-duplication?

 De-duplication is, as the name implies, the removal of duplicate files from a set of data. However, unlike the name, this is not as straight forward as it sounds.

Why is De-Duplication Conducted?

Within a company’s data set there are many duplicate files, e.g. one file on a desktop, another on a file server, a third in an email. If backup tapes are used, every backup tape will potentially have a huge number of duplicates.

Example: A data set consists of one file server, which has 1,000,000 files on it and a single backup tape containing 900,000 files. While there is a total data set of 1,900,000 files, it may be that 900,000 on the file are duplicates from the 1,000,000, as the backup tape was taken from the server. Within the 1 million files on the server there may be 20% duplicates, i.e. there are only 800,000 unique files, rather than 1.9 file, and there is no point in looking at the additional identical 1.1 million files.  Without the de-duplication the cost of reviewing the data would (in this theoretical example) more than double.



E-Files are the easiest of type of file to de-duplicate. The tools involved in this process conduct a “simple” mathematical process, known as a “hash” (usually an MD5 or sometimes a SHA-1). A has is unique to a file, if the file changes at all hash will change. If single full stop is added to a 700 page document, the hash number will be completely different. Therefore if two files have the same hash value, then they are the same and the duplicate values can be removed, so that the investigator/reviewer does not need to see this document.



Emails are not as simple to de-duplicate as one may hope for. The reasons for this include:

  • Person A in Company A sends an email to Person B, in company B. The email sent and the email received are the same email, the data has not changed, and therefore they are duplicates however the email in the sent items of A is a physically different files to that in the Inbox of B.  Once the reasons for this is that the outgoing message does not have a message header, the incoming one does. Once the messages are taken out of their mail boxes and hashed, they will be different as the messages files are different, though the email content is the same.
  • The same is true for messages that have been copied to other people.
  • Emails are sent from an Outlook Email box to an OutlookExpress mail box, one stores the emails as a MSG the other an EML. Therefore the files, while containing the same message are clearly physically different. As a result they will not have the same hash value.

Due to the problems with de-duplication of emails a different approach needs to be taken. One approach is to hash all of the different parts of an email, the date, author, recipients, message body, etc and then combine these hash values together to create a new value. It is this new hash value that is used to measure if an email is unique or not.

 This way if the emails are “the same” they can be de-duped even if they are not identical files, in the computer forensics sense.



File attachments to an email can cause debate amongst people. The aim of a review platform is to ensure that a client can review all of the data they need to without duplicating the work, but with attachments this starts to become a bit of a grey area.

If two emails are the same, they are treated as duplicates and removed. If two e-files are identical, they are de-duplicated. But what if an e-file is loose in a folder but is also attached to an email elsewhere?  Are both files shown, isn’t that a duplicate? If you remove the file attached to the email then you have broken the “family” of documents.

What about if there are two emails, which are different but have the same attachment. Are both those files put in for review? If they are then work effort is duplicated, and people can mark one file as relevant and another file as non relevant. Equally if they are not both brought through to review, it means breaking up the family of documents.

Another option would be to treat them as separate files, but bring their hash values through to the review platform and allow the review platform to recognize that they are duplicates. This way, one file is marked as relevant the other, on a different attachment will also be marked as relevant.

But this seemingly obvious solution presents a problem, that is the issue of marking families, e..g if you mark an email as confidential/privileged does this mean the rest of the family is? Probably. Therefore if you mark the attach as confidential in one location, because of the email its attached to, it would be also privileged in another attachment, where it could may not be privileged, but relevant and should be disclosed.

These problems may seem unlikely, and there is a low probably of any one of these complex scenarios occurring. But, with the document sets in the millions or tens of millions, the even if there is 1 in 10,000 chance of this occurring there will still be 100 or 1000 different situations where this will occur in one case, let alone multiple cases. De-duplication – not as simple as name implies.

Electronic Discovery: What is Near Deduplication?

What is Near DeDuplication?

Near de-duplication is best described as a fuzzy version of  the standard de-duplication which, in brief, states that if two files are identical then only one needs to be reivewed, i.e. shown in a  review platform.

[The world of de-duplication is vast and full of opinion. Diffences of opinion vary on all aspects from how emails are de-duped, to global versus custodoian de-duplication or should the same attachments, to different emails be brought forward in a review platform. This questions are not going to be addressed in this article].

Near de-duplication states that if two documents are similar then they can be grouped together, and then possibly ignored or marked as relevant.

Example deduplication: Two loose word documents exist, one in folder A the other in folder B. Both files are identical and have the same MD5 Hash value (and agorithym used to show files are the same). Under standard de-duplication only one of these files would need to come through to review, as they are both the same there is no need to review both.

Example Near De-Duplication: One file exists “File AV1.0”. This file is then opened, spell checked, and then saved as “File AV1.1”.  These files are very similar and are classed as  “near de-duplicates”.

Grouping Files Together

File Av1.0 and File AV1.1 are almost the same file, this means that  if one is relevant to the case, the other will be , and vice versa. If for example, the case is about an employment dispute and a 50 year old manager felt he was passed over from promotion, and File AV1.0 is about peguin food, they are both going to be non relevant. Equally, if they are both internal memos about if they should promote the manager then they will both be relevant.

In a company there will be many documents which are similar, e.g multiple revisoins of contracts, documents converted to PDF, etc, and these different versions can be grouped together. Once this grouping is completed a reviewer can look at one, or a sample of documents in a group (dending on the groups size), and then mark the whole group as relevant or non relevant as applicable, or perhaps if the group looks interesting they conduct a more detailed review of each file, knowing that the grouping is going to be similar. Equally if the group of documents is all about penguin food, or really any antartic food supply, the reviewer can happily mark the group as not relevant and move onto the next group.


Near De-Duping sounds like an excellent tool, it groups together “near de-duplicates”, it does exactly what is says on the tin.

But, what does “near” mean, how “near” does it need to be? This is really the big question for near de-duplication technology, and this is where it all gets a bit fuzzy. In essence when the program is run on the files, it just looks at the text, so any issues with metadata or file types can be ignored, which is a reasonable assumption to make.  The near de-duplication looks at the differnence between the text of the two documents, e.g if one document has a “;” instead of a “,” the documents are very similar, if one is about the “computer forensics” the other the term “IT forensics” has been used then they may just be similar.  The tools then tend to give a percentage match. The threshold for grouping the documents together can be set, e.g at 50% or 80%. 100%, would imply an exact duplicate (for the text). 

The problem is, that while the maths is very impressive, and the tools are almost certainly required to deal with the huge amounts of data, what goes on behind the engine does not apear to be transparent, and that lack of openess or independant research appears to hamper the uptake of tools.

Would you be the first to go to court, in your jurisdiction, which “just works”? 

Possibly,  because even in 2006 the FTC was recommending its use: The use of “de-duplication” and “near-de-duplication” tools can effectively reduce production volume and costs… But the FTC also said [Near De-Dduplication tools] can also hinder investigations. Thus, staff must be advised about its use: