What is Near DeDuplication?
Near de-duplication is best described as a fuzzy version of the standard de-duplication which, in brief, states that if two files are identical then only one needs to be reivewed, i.e. shown in a review platform.
[The world of de-duplication is vast and full of opinion. Diffences of opinion vary on all aspects from how emails are de-duped, to global versus custodoian de-duplication or should the same attachments, to different emails be brought forward in a review platform. This questions are not going to be addressed in this article].
Near de-duplication states that if two documents are similar then they can be grouped together, and then possibly ignored or marked as relevant.
Example deduplication: Two loose word documents exist, one in folder A the other in folder B. Both files are identical and have the same MD5 Hash value (and agorithym used to show files are the same). Under standard de-duplication only one of these files would need to come through to review, as they are both the same there is no need to review both.
Example Near De-Duplication: One file exists “File AV1.0”. This file is then opened, spell checked, and then saved as “File AV1.1”. These files are very similar and are classed as “near de-duplicates”.
Grouping Files Together
File Av1.0 and File AV1.1 are almost the same file, this means that if one is relevant to the case, the other will be , and vice versa. If for example, the case is about an employment dispute and a 50 year old manager felt he was passed over from promotion, and File AV1.0 is about peguin food, they are both going to be non relevant. Equally, if they are both internal memos about if they should promote the manager then they will both be relevant.
In a company there will be many documents which are similar, e.g multiple revisoins of contracts, documents converted to PDF, etc, and these different versions can be grouped together. Once this grouping is completed a reviewer can look at one, or a sample of documents in a group (dending on the groups size), and then mark the whole group as relevant or non relevant as applicable, or perhaps if the group looks interesting they conduct a more detailed review of each file, knowing that the grouping is going to be similar. Equally if the group of documents is all about penguin food, or really any antartic food supply, the reviewer can happily mark the group as not relevant and move onto the next group.
Near De-Duping sounds like an excellent tool, it groups together “near de-duplicates”, it does exactly what is says on the tin.
But, what does “near” mean, how “near” does it need to be? This is really the big question for near de-duplication technology, and this is where it all gets a bit fuzzy. In essence when the program is run on the files, it just looks at the text, so any issues with metadata or file types can be ignored, which is a reasonable assumption to make. The near de-duplication looks at the differnence between the text of the two documents, e.g if one document has a “;” instead of a “,” the documents are very similar, if one is about the “computer forensics” the other the term “IT forensics” has been used then they may just be similar. The tools then tend to give a percentage match. The threshold for grouping the documents together can be set, e.g at 50% or 80%. 100%, would imply an exact duplicate (for the text).
The problem is, that while the maths is very impressive, and the tools are almost certainly required to deal with the huge amounts of data, what goes on behind the engine does not apear to be transparent, and that lack of openess or independant research appears to hamper the uptake of tools.
Would you be the first to go to court, in your jurisdiction, which “just works”?
Possibly, because even in 2006 the FTC was recommending its use: The use of “de-duplication” and “near-de-duplication” tools can effectively reduce production volume and costs… But the FTC also said [Near De-Dduplication tools] can also hinder investigations. Thus, staff must be advised about its use: