What is de-duplication?
De-duplication is, as the name implies, the removal of duplicate files from a set of data. However, unlike the name, this is not as straight forward as it sounds.
Why is De-Duplication Conducted?
Within a company’s data set there are many duplicate files, e.g. one file on a desktop, another on a file server, a third in an email. If backup tapes are used, every backup tape will potentially have a huge number of duplicates.
Example: A data set consists of one file server, which has 1,000,000 files on it and a single backup tape containing 900,000 files. While there is a total data set of 1,900,000 files, it may be that 900,000 on the file are duplicates from the 1,000,000, as the backup tape was taken from the server. Within the 1 million files on the server there may be 20% duplicates, i.e. there are only 800,000 unique files, rather than 1.9 file, and there is no point in looking at the additional identical 1.1 million files. Without the de-duplication the cost of reviewing the data would (in this theoretical example) more than double.
E-Files are the easiest of type of file to de-duplicate. The tools involved in this process conduct a “simple” mathematical process, known as a “hash” (usually an MD5 or sometimes a SHA-1). A has is unique to a file, if the file changes at all hash will change. If single full stop is added to a 700 page document, the hash number will be completely different. Therefore if two files have the same hash value, then they are the same and the duplicate values can be removed, so that the investigator/reviewer does not need to see this document.
Emails are not as simple to de-duplicate as one may hope for. The reasons for this include:
- Person A in Company A sends an email to Person B, in company B. The email sent and the email received are the same email, the data has not changed, and therefore they are duplicates however the email in the sent items of A is a physically different files to that in the Inbox of B. Once the reasons for this is that the outgoing message does not have a message header, the incoming one does. Once the messages are taken out of their mail boxes and hashed, they will be different as the messages files are different, though the email content is the same.
- The same is true for messages that have been copied to other people.
- Emails are sent from an Outlook Email box to an OutlookExpress mail box, one stores the emails as a MSG the other an EML. Therefore the files, while containing the same message are clearly physically different. As a result they will not have the same hash value.
Due to the problems with de-duplication of emails a different approach needs to be taken. One approach is to hash all of the different parts of an email, the date, author, recipients, message body, etc and then combine these hash values together to create a new value. It is this new hash value that is used to measure if an email is unique or not.
This way if the emails are “the same” they can be de-duped even if they are not identical files, in the computer forensics sense.
File attachments to an email can cause debate amongst people. The aim of a review platform is to ensure that a client can review all of the data they need to without duplicating the work, but with attachments this starts to become a bit of a grey area.
If two emails are the same, they are treated as duplicates and removed. If two e-files are identical, they are de-duplicated. But what if an e-file is loose in a folder but is also attached to an email elsewhere? Are both files shown, isn’t that a duplicate? If you remove the file attached to the email then you have broken the “family” of documents.
What about if there are two emails, which are different but have the same attachment. Are both those files put in for review? If they are then work effort is duplicated, and people can mark one file as relevant and another file as non relevant. Equally if they are not both brought through to review, it means breaking up the family of documents.
Another option would be to treat them as separate files, but bring their hash values through to the review platform and allow the review platform to recognize that they are duplicates. This way, one file is marked as relevant the other, on a different attachment will also be marked as relevant.
But this seemingly obvious solution presents a problem, that is the issue of marking families, e..g if you mark an email as confidential/privileged does this mean the rest of the family is? Probably. Therefore if you mark the attach as confidential in one location, because of the email its attached to, it would be also privileged in another attachment, where it could may not be privileged, but relevant and should be disclosed.
These problems may seem unlikely, and there is a low probably of any one of these complex scenarios occurring. But, with the document sets in the millions or tens of millions, the even if there is 1 in 10,000 chance of this occurring there will still be 100 or 1000 different situations where this will occur in one case, let alone multiple cases. De-duplication – not as simple as name implies.