Forensics: Hashes, do they work?

What’s the big deal about hashes?

This article follows on from the previous myths about “verification” in EnCase article.

Hashes are the back bone of computer forensics, they are used to identify and remove junk data with the NSRL/NIST list. They are used to de-duplicate files in computer forensics and electronic discovery, and they can also be used to de-dupe emails in electronic discovery. But, they also form the foundation of evidence security:

  • “If the hash has not been changed the data cannot be changed”
  • “The evidence cannot be tampered with because of the security of hash values”
  • “The evidence is protected by the hash values”

These are just some of the common statements about hash values made by many, if not all forensic investigators, to clients and courts alike.

But is this true?

Subjects such as hash collisions are not relevant for this article and other, far more gifted writers, have already ably demonstrated that, for the purposes of computer forensics, the MD5 and SHA-1 hash are mathematically secure.

It is not the issue of the mathematics and cryptography that is being debated here, but rather the protocol. Can a person stand up in court and state that:

The evidence cannot been tampered with because the hash has not been changed

The imaging process

Very briefly this is what happens to evidence drive during the imaging process, for both criminal and civil offences:

  • A suspect’s hard drive is connect to a computer (a hardware write blocker is normally used, but systems like Linux imaging platforms and software blockers can be used with or without hardware write blockers).
  • A hash value is calculated for the image
  • The hard drive is returned

For the criminal case the hard drive will be returned to a safe/evidence room, until it is required. During the case the defence, in theory, can ask to see the original and check the hash value.

For the civil case the hard drive will normally be returned to the user and, more often than not, the computer will be immediately booted, as the custodian/user/client needs to start working straight away.

This will immediately change the state of the disk; therefore the only evidence of the original hash value is the image. The civil investigator, if challenged, can refer to the original hash value to show the data has not changed since he took the image.

Different Data – Same Hash Value

In both of these scenarios there is one rather obvious problem, the evidence can be tampered with, and the hash remain unchanged.  It just depends on when the data is tampered with.

People debate the pros and cons of hash values and protecting media: SHA-1 v MD5, the software write blockers V hardware write blockers, Linux v Windows, ICS v Tableau, etc, but they never debate the far more common scenario  – bad people.

Note: Before going any further it should be made clear that there is no allegation that this has happened, it is merely a theory.

Let’s take those scenarios again:

  1. Person (A) gets takes possession of a hard drive of a suspect and then connects it to  their forensic computer.
  2. Person (A) then images  the data
  3. The computer calculates the hash value  = ABC123
  4. Person (A) provides evidence to court that the image at the time had the hash value ABC123 and has the same value now.

But, what if Person (A) was corrupt? What if Person (A) wanted to frame the suspect?

What if the sequence of events occurred as follows:

  1. Person (A) gets hold of a hard drive or a suspect and then connects it to their computer
    1. Person (A) runs a script that dumps illicit material into unallocated space.
  2. Person (A) then images  the data
  3. The computer calculates the hash value  = ABD123
  4. Person (A) provides evidence to court that the image at the time had the hash value ABD123 and has the same value now.

Item  1.1 would take minutes, if not seconds to run, and would be undetectable to the naked eye. Once the data was on the computer it could be impossible to prove that is had been added deliberately.

There would be no need for the procedure to change the dates, this means that it’s entirely possible to insert data and just lie about it.  The hash value of the image will not change and for a criminal case the hash value of the hard drive in storage would be the same as the image – because the illicit material was added before the hash value was calculated.

A physical example of this could be a crooked cop planting drugs on an suspect, the drugs would then be found (following the search) and put in sealed bag, and if the bag was opened later and tested it would be found to be drugs.  The evidence of the seals would be used to prove that the drugs had not been tampered with.  But this does not make any difference, the drugs were planted, the seals don’t help the person who has been set up.

Could it happen?

The scenario given has never been reported, there has never been a report of the police or a civil investigator inserting evidence onto a hard drive, nor is there reason to believe it has occurred and been unreported.  But could it happen? It is technically possible, but would people really do this?

Firstly the police have misused data many times in the UK; secondly people have lied on the stand, on more than one occasion – in relation to computer forensics. Thirdly, police officers have been convicted of all sorts of offences, from blackmail through to rape. Why? It’s not because the police are particularly corrupt, it’s just that they are selected from the public, and there are crooks in the public, therefore there will be some criminally minded individuals in the police, though much less than in the public as a whole.  If you’re willing to commit rape and blackmail your probably willing to add a few 1s and 0s to a hard drive. During the 1970s and 1980s certain parts of the police (in the UK)  had a bad reputation when the law was bent and broken, to get convictions of “the bad guys”, inevitably innocent people where caught up in this and innocent people were wrongly convicted.  In fact it was such a problem new laws and procedures were brought into to try and combat this.

But could it really happen now?

But now, in the modern world could the police really fake evidence? Sadly yes, it could still happen. The most obvious example of this is the fingerprint case involving the Scottish police. In this case numerous fingerprint officers in Scotland went to court and testified that the fingerprints they had obtained from a case proved that two people were guilty, one of murder the other of perjury: the latter was a fellow police officer.  But, on appeal the worlds fingerprint officers stated that clearly both fingerprints did not belong to the two people in jail – in short the Scottish police were accused of framing people by lying about fingerprints.

Fingerprints are the closest analogy in the physical world there is to a hash value, and here were criminal investigators lying about fingerprints.  Is it impossible to believe that a person would lie about a hash value, or plant data on a person?

Put it another way, given the nature of human beings – which is more likely:

(a)    A hash collision or

(b)   A forensic investigator doctors evidence to get the result they require, given that the latter has happened throughout human history and the former has never happened  in anything other than a maths paper.

I believe the answer is in the question!


How can forensic investigators combat this? Firstly the concern about hash value security needs to be replaced with concern about procedures and processes that show that a person is being honest. Possibly the best way forward is to follow the example of the traffic police.

The traffic police have their gadgets, lots of gadgets:  Radar, lasers, speedometer, etc, all of which can be used to catch us (me) going slightly (a lot) faster than we should do. But, what do they say when they pull you over?

They say that “We observed you travelling at excess speed because [description of the car and movement], and we can provide additional evidence of this through this device [laser, radar,etc].”

They could have pointed the radar gun at a motorbike doing 110mph and said it was you ; but it is their word in court.

The primary evidence is their word, supported by technology.

Therefore, when in court, the primary evidence that the data has not changed should be the word of the person presenting the evidence, supported hash values, and not the other way around.

It is the word of the investigator that we are all counting on, from the data collection through to the final analysis; investigators should not shy away from this.


Electronic Discovery: Is De-Duplication “Accurate”?

Is De-Duplication Accurate?

De-Duplication is the one thing you can rely in electronic discovery. Everyone does it. The one man company, the small teams, the boutiques, and the giants. It’s like the safety briefing on a plane, everybody does it, everybody knows about it.

During the pre-flight briefing you are told that, should the worst happen, the mask will pop out above your head and you should fasten it securely around your mouth, before helping others. This is told to you before you get nuts and a glass of wine. This is such a well know part of the safety briefing that if an airline did not have those ubiquitous masks you may want to get off.

But, should the worst happen and you hurtle towards the earth, from 30,000 feet for a certain fiery death, that plastic mask will not be of much comfort, or use.

Would you use a company that did not use de-duplication? No, of course not. De-Duplication gets rid of “identical” data. It removes all of the duplicate data, so that you are only seeing only one copy of each file. De-Duplication is the nice relaxing part of every e-discovery pitch; everybody knows about it, everybody does.

But, what if the worst happens and you’re hurtling towards court and there is an “alleged” problem with some of the metadata. A computer forensics expert witness is being brought in to challenge the e-discovery information presented.

Will the pre-project briefing be of much comfort?

Electronic Discovery v Computer Forensics

The e-discovery argument is this:

All files are hashed using a highly secure algorithm called an “MD5”. Different files have different hash values. In fact even if a word document changes by just a single full stop the entire hash value will be different. Therefore files have to be truly identical for them to have the same hash value. For this reason we can, with 100% safety, remove all of the files with identical MD5 files, as we are removing only duplicate files

Enter, stage left, the computer forensics man. This man tracks down internet predators, and convicts terrorists, and knows all that good stuff about deleted files.

He states, with absolute confidence “I know more about MD5’s than you do, ED vendor” then pauses for dramatic effect.

The MD5 is a function of the data. Of the actual data of the file, so the contents of the file need to be identical but the metadata does not

The CF guy then proceeds to demonstrate his point through as series of fantastic videos [soon be made available YouTube].

The basic issue is this. The name of a file, the location of the file, and some of dates of the file are NOT part of the MD5 calculation.

This means that file called “MADE UP OIL RESERVES REPORT.TXT” created in 2001 could have an identical hash value to a file called “FACTUAL OIL RESERVES REPORT.TXT” created in 2002.

That’s pretty mind blow stuff.

If you are conducting a review, into dubious claims of oil reserve, about a famous oil company and a file with the name “MADE UP OIL RESERVES REPORT” is removed, as its “identical” to a file called “FACTUAL OIL RESERVES REPORT” that’s an issue a big issue. As the dates are also different that is going to caue problem as files can be missed, or misinterpreted due to the dates.

This is an extreme example but there are lots of scenarios that can be played out in your own mind about the issue of de-duplication and how the name, file path, or date of the document could be important to a case.

Daming stuff? De-Duplication should not be done?


99.9% of the time this scenario is not going to occur and de-duplication is going to work perfectly. [I have no idea about the  statistics, I just made that bit up, but it’s going to be a very high number]

Aah, but what about that 0.1%”.

Previous articles have looked at the issues of low probability errors on large data sets, and readers will know that a 0.1% errors/issue over a million documents will result in 1000 files which have issues that would fall through the net.

Even if the de-duplication issue highlighted above only has a “1 in a million chance of occurring” in a data set of 2 millions files it would be expected to happen twice.  Pretty sobering stuff.

Firstly if de-duplication was not done, it would mean that there would be millions more files to review, and the review could not be achieved, financially. Therefore de-duplication has to be done, it’s not an option like sugar with tea.

Secondly, ED companies do/should/must keep track of what has been filtered out by de-duplication, or any other method. Therefore if a critical file is found, and a dates are relevant  a more detailed  investigation into that file and other identical files can easily be done.

Thirdly, some ED companies approach reviews in different ways. Some put all of the files into review (duplicates and unique files) and when a file is marked as “relevant”, “hot”, “ignore”, etc then the duplicate is automatically marked, but the duplicate is still available for review if required. Other companies approach de-duplication in another way. They don’t include the duplicate files, but they provide a “place holder” instead. This allows the original duplicate files to be easily traced, again resolving the problem.

With the advent of concept searching and near de-duplication this scenario is going to become more common, not less.

As long vendors keep track of documents and lawyers are aware that everything that glistens is not golden, every date is not accurate, and every duplicate document may not be “identical”, and there is clear communication between the vendors/consultants and legal teams this should not pose a problem, even on the very rare chances it does occur.


De-Duplication is good and accurate, just make sure its tracked.

Forensic: EnCase Verification, MD5, and Other Myths

Encase is without doubt the most popular forensics tool on the market, however due to the name of one its features, it has also started one of the most common myths. Verification.

When EnCase completes an image it then conducts a “verification” and when it completes, it brings up a variety of hash values, and confirms that the data has “verified”. Excellent. Data verified…no not at all.

The EnCase verification does not check the original data, it check the destination data. This is an often misunderstood point, but one that can be critical.

A very simple test of this, for the doubting Thomas out there, is to simply disconnect the original drive while the verification is being carried out. The verification will still complete, successfully, despite the fact there is nothing to verify against. The reason is this:

The verification checks the image file, it verifies the integrity of the image files, an important process. It does not check if the data imaged is correct – a very important difference.

Example: Company X is at an clients site  imaging hard drives, they are using Tableau write blockers, connected to laptops and imaging to USB drives from a well known brand (inside the USB case is a 3.5 inch 500 GB S-ATA).  The drive to be imaged is an old 2.5 inch IDE drive.

The 2.5 inch drive, an old laptop drive, is taken out of the laptop and connected, via a 2.5 to 3.5 inch converter to the tableau write blocker, which is then connected, via USB, to the the laptop.

The person imaging selects the source drive, the 2.5 inch and sets the destination drive as the USB drive, this means that the data takes the following route.

1) It is read from the old, dusty, 2.5 inch hard drive.

2) It goes out the 2.5 inch pins, into the 3.5 inch converter.

3) From the 3.5 inch converter it goes along an IDE cable.

4) From the IDE cable it goes along to the Tablea write blocker.

5) The black box that converts the IDE to a USB.

6) The tableau then transmitts the data down a USB cable.

7 ) The USB cable connects to the laptop USB port.

8 ) The laptop USB port then connects to the motherboard.

9) The data is then transferred internally, and EnCase then “reads” the data.

10) Encase then “write the data” out and it travels along the mother board to another USB port.

11) From the USB port it goes down a USB cable to the USB drive.

12 ) The USB drive then converts to a 3.5 inch S-ATA drive.

13) The 3.5 inch S-ATA then write the data.

It is not until step (9) that Encase reads the data. It is that data that EnCase then writes, and then verifies what it has written. If it is feasible for an error to occur between 9 and 13, hence the need for the verification, it is also feasible, if not more so, that an error occurs between 1 and 9.

If the hard drive is not working correcty, or the cables are damaged, or the pins are not aligned correctly, or any of a host of other reasons then the hard drive will not image correctly. 99% of the time this error will be a very obvious error, e.g the hard drive will not spin up, or it cannot be seen – which is a good error to have, as it can be addressed.

Sometimes, very rarely but sometimes, the drive will image, but it will be producing junk data, or “skewed” data. While this is rare, it certainly does happen (unlike the theoretical problem of MD5 collisions). i.e. this is a real world problem, not just one confined to labs and mathematics papers.

In the worst case scenario this means that data will be imaged, Encase will read it, write it, and then verify it. The person conducting the image will then leave the scene and state, without intending to lie, that they have a 100% accurate image of the data.When in actual fact they have junk. This can, and does lead to all sorts of problems.

In one case the image of a single hard drive was taken at a “suspects” home, the image was verified and then taken back to the office.  The image was later investigated, from the investigation the examiner concluded that the user had wiped their drive with a tool that deliberately made a mess of the MFT.

What had actually happened is that the image of drive was poor, and much of the MFT was skewed during the imaging process, probably due to bad electronics/electrics somewhere in the imaging process. i.e. they had not taken a good image. But the person investigating the drive did not know/understand this and as a result produce a very detailed report explaining how the drive had been deliberately wiped to hide information.

The suspect/victim of this allegation was fortunate in that the computer was working (and shown to be working) prior to the image being taken and was working after the image being taken; this was, oddly, recorded by the person conducting the image. From this alone it was very obvious that the one and only drive in the computer could not have been wiped. But, in this case a long and detailed report, accusing the suspect/victim of wiping evidence  was submitted. While there was no evidence of the original allegations, the report stipulated, at great length, that the suspect had wiped their drive, and therefore conclusions could be drawn from that. The person writing the report was adamant that the image was correct, because it verified when he wrote the report. Even though he was hundreds of miles from the actual hard drive, the myth of EnCase Verification was so strong, that he believed that the verification guaranteed the quality of the data. A common belief.

A second image was taken, correctly, and the drive examined. From this it could be seen that there was no evidence of wiping, nor evidence of the original allegations. The  suspect/victims statement that that there computer was working were fully corroborated, and they were proved innocent.