What hope for Gary McKinnon?

Gary McKinnon is failing at every hurdle to fight his extradition, including losing the most recent, and possibly last, legal battle.  Some of the facts of the case appear clear.

  • Gary hacked into US military and NASA security systems
  • Gary has aspergers (a form of Autism)
  • There is no belief in the UK legal system that he is a threat to national security, for the UK or US governments
  • If he was tried in the UK, and found guilty, he would get a a couple months in a low security prison, if anything
  • If tried in the US he could be sentenced up to 70 years.

There are strong debates as to whether the UK should allow the extradition; often split by those on either side of the Atlantic.

In the UK the strength of public opinion, based on the legitimate concern for Gary’s future, appears to in favor of trying Gary in the UK rather than the US. For this to happen the police just need to arrest and charge him, and then the extradition problem disappears as UK law takes precedent.

While there is a debate all we can hope for is fairness in the legal system; that is highly unlikely when politicians like Harriet Harman are involved in the political system.  Harman is  a a labour MP as well as Government Equalities Office (since 3 Dec 2007) and Lord Privy Seal, House of Commons (since 28 Jun 2007), and the deputy prime minster. Currently, while Gordon Brown is on holiday, Harriet is running the UK.

Harriet Harman voted strongly for terrorism laws, for the Iraq war, and ID cards, but very strong against an investigation into the Iraq war.  During the expenses scandal she was found to have hired “consultancy” services on the public purse; claimed for party political propaganda and bought expensive gadgets, which the UK tax paper paid for.

This is relevant because on Sunday 2nd August 2009 she was interview on the BBC 1 Andrew Marr show, as her role as deputy prime minster. She was asked  direcetly about the Gary McKinnon case, and if the UK government would step in to prevent the extradition. Harriet’s answer was

It is not the role of politicians or the parliament to second guess the courts”

Yet in March 2009 she stated that:

The Prime Minister has said it is not acceptable therefore it will not be accepted. It might be enforceable in a court of law but it’s not acceptable in the court of public opinion, and that’s where the Government steps in.

The latter comment was in relation to the pensions payment of Sir Fred Goodwin.

This extreme example of flip flopping is both staggering and concerning. What is shows is that the UK Government, which Harriet Harman is a senior member of, has very little concerns about following guidelines, process, or even sticking to a consistent legal standard. In short they will take whatever course they see fit, and that was a clear message that Gary McKinnon will not be protected from extradition.

Based on this, Gary McKinnon has little to no chance to preventing his extradition. Right or wrong, the comments by the current acting prime minster are concerning.

Is Electronic Discovery Wrong?

The Hypothesis: Simply put electronic discovery is wrong. It is not 100% accurate, far from it. It’s inherently inaccurate. It misses out lots of data. The legal system should find other, more accurate methods.

__________________________________________________________________________________

Out of the ten of thousands of files collected from any one computer only a fraction of those are going to be taken out of the computer for full ED processing.

Out of the small percent that do get processed tens, hundreds or possibly thousands will fail due to errors, corroption, encryption, or other problems. Eventually, after the initial cull and the ED processing, a small set of data will be loaded into a review platform.  Once the data is reviewed even greater cuts will be made with huge percentages of documents removed through keyword searching.

This means that out of the 10,000s of documents originally available only a few thousands or a couple of hundred, per computer, will even be given the opportunity to be reviewed by teams of lawyers.

This is where the errors really start.

The bulk of reviews are conducted not by partners or senior associates, but by junior staff.  Sometimes contractors brought in for this purpose.  These junior staff people are working quickly, under pressure, with a requirement to review their assigned documents as quickly as possible.   The subject matter will often be new to them, the review platform may also be new to them, they are junior, under pressure, possibly tired, and working for long hours.  We don’t let people drive trucks in these conditions, let alone make decisions on multi-million/multi- billion dollar cases.

Even a highly experienced litigator, working short hours  on  a subject they know will make errors. It’s just a statistical fact. The junior staff, working in the conditions described will make a lot more errors; and these errors will not be even. The type of error that a person makes on one day will not be the same on another day, and different people will make different errors. This makes the QC of the errors difficult.

Does this  means that electronic discovery is wrong? Is it fundamentally flawed, so much so that a better method should be used?

No; quite the reverse.

Electronic discovery has errors, but so do must other systems of evidence. Fingerprints are not, as many people think, 100% reliable, but require human interpretation.

Witness statements are staggeringly inaccurate. A person who has witnessed a shooting or a road accident, will be traumatized, upset, angry, and influenced by the police (even accidentally through leading questions “Did you see the suspect get into a green car?”).  The witness will have their own prejudices about what they saw, what they think they saw, and what they think happened.  Witness statements are, in short, not 100% reliable.

Even DNA, the trump card in any investigation has errors, in all levels. The science is sound but humans make mistakes.

Does this mean that DNA, fingerprints and witness statements should not be used because there can be errors? Of course not.

The issue is not that mistakes are made, but that they are understood and accounted for.

Is the electronic discovery process less accurate than DNA or fingerprints, certainly? But does that matter?  The issue of accuracy must also be understood in terms of what is being measured.

If a forensic scientist makes an incorrect decision with their fingerprint or DNA analysis than they can state that Person A killed Person B, rather than Person A did not kill Person B (and this has happened, on more than one occasion). This is a massive error.

If ED processing is conducted incorrectly then a junk document, Document A may be reviewed by a lawyer when it was supposed to left out, or a document that was supposed to be reviewed was not.

Incorrect ED processing is very different. It does not change an accounting spreadsheet from profit to a loss, it does not make perfectly legal business transactions into a multinational fraud. ED processing does not interpret the data.

An ED processing guru does not go to court and state “Due to the MD5 value of this document I have  concluded that the money was moved from the US, to Switzerland, then back to BVI, for the purposes of tax avoidance, defrauding the US government of 7.5% of the gross amount in tax, meaning that the suspect only paid 11.5% gross. The net cost to the US government is $3.5 million. With that money the suspect invested in property, making a net loss over the year, for the investment, but he still retains assets of $2.5 million…..”

No, an ED guru should go to court and say “I have provided the documents as best as I can, not 100% of documents, but the best sample I can,  and here is how….”

Everything else is up to the accountants, lawyers, and other expert witnesses. This does not demean what an ED person does, but rather allows them to put it into perspective what they are doing and asses the risk.

If they are collecting hundreds of millions of documents (which is entirely possible), and thousands are not processed that is reasonable, expected, and required.

Electronic Discovery has errors, but it does not make it wrong.

Electronic Discovery: Concept Searching and Error Rates

Assessing the risks and benefits of using conceptual searching technology to cull data, compared with traditional methods of culling data.

As data sizes within companies increase, so do the number of documents available to review. As document sets are now regularly in their hundreds of thousands or even millions reviewing all of these documents is no longer possible; therefore culling of the data has to be conducted. Traditionally this culling has been conducted via the application of keyword searching, de-duplication, and document filtering.

However, due to the sheer scale of documents in modern corporations traditional methods of culling will not always reduce the data to an affordable number of documents. Therefore vendors are increasingly offering new technologies, concept searching and near de-duping, to reduce the data volumes even further. These technologies allow review teams to radically cull documents and complete reviews in time scales that were simply not possible before, but huge volumes of documents are not being reviewed, even though they were collected

What is the risk of the documents not being reviewed and being relevant? Are these tools proportional?

Assumptions

In this working  example data is to be collected, over a period of 1 year, from 21 people. Only email and personal computers are to be collected.

Example 1

  • There are 21 people to be investigated. Each person has a laptop and an unlimited mailbox.
  • A person receives 120 emails per day 5 days a week, 20 days a month, 12 months a year.
  • There is an average of 15,000 files per PC, per person.
  • There are 18 backup tapes per year. 5 daily, 12 monthly, and 1 yearly back up.
  • Users do not delete their email, and therefore the backups contain an exact copy of the live email.
  • 30% of emails sent are between parties within the company and so are duplicates.
  • Keyword filtering reduces the data set by 66%.
  • An member of a review team bills $1000 a day and can review 500 documents a day.
  • There are 5,000 documents and emails, in total, relating to the subject matter.

Data Volume Calculations

Media Source Per Person Entire Data Set
Laptop 15,000 315,000
Email Per Year 144,000 3,024,000
Backup Tapes 2,592,000 54,432,000
Total Data Set 2,751,000 57,771,000

Based on these assumptions, there will be nearly 58 million files, within the available data set. The vast majority of this data will be duplicates. As all of the email captured on the backup tapes (in this scenario) also remains[1] on the live email account, the entire backup set can be filtered out (for calculation purposes), resulting in around 3 million files.

Keyword searching this data, culling at 66%, would leave just over 1 million files.

Further de-duplication across the data set, e.g. the same email between multiple people, would remove 30%, resulting in just over 650,000 files.

The original data set of 58 million documents has been culled to 650,000 documents; this is a 98.9% cull.

Despite the huge cull of data, the cost of an initial review of these documents is expected to be $1,300,000, this is to locate just 5,000 documents.  The costs of hosting and culling the data can be expected to be of a similar order of magnitude.

Errors in Review – Keyword Selection

Despite the huge cost of the review it can be expected that there will be significant errors in the review. In the example above  final cull of data was 99.81%, of this over 2.6 million files were removed either by keywords (chosen by a legal team) or removed during a review (by the legal team).

Keyword choice alone is problematic as highlighted by Mr. Justice Morgan in October 2008 in the DigiCel v Cable and Wireless case (HC07C01917)[2], and by U.S. Magistrate Judge Paul Grimm in Victor Stanley Inc. v. Creative Pipe Inc. 2008 (WL 2221841), who stated[3] that those involved in keyword selection were going where “Angels fear to tread”.

Judge Paul Grimm said that: “Whether search terms or ‘keywords’ will yield the information sought is a complicated question involving the interplay, at least, of the sciences of computer technology, statistics and linguistics…. Given this complexity, for lawyers and judges to dare opine that a certain search term or terms would be more likely to produce information than the terms that were used is truly to go where angels fear to tread. This topic is clearly beyond the ken of a layman.”

Errors in Review – Human Error

In addition to the errors by the selection of keywords errors will be conducted by those reviewing the data.

Measures of human error are difficult to quantify; but one known method is the Human Error Assessment and Reduction Tool, HEART, developed by Williams in 1986. This looks at the variables in a scenario and attempts to predict the error rate. Factors include inexperience, risk misperception, conflict of objectives and low morale, virtually all of which are present, to a significant degree, for a junior member of a review team working on large scale project.

Other studies[4] of human risk perception have shown that people are likely to follow others, even if they are wrong and they know the person they are following is wrong. In addition to this people, develop a mind set to back up their errors and continue in that direction, often repeating that error.

All of this means it is likely that humans will conduct errors within a review

Concept Searching – Theory

Concept searching is the application of technology that looks at the content of documents and attempts to group them together. Different concept searching technology uses different mathematical models to conduct this concept searching including, Bayesian theory, Shannon Information theory and Latent Symantec Indexing. Some concept searching tools are pre-programmed with languages others learn the language from each particular project. Despite the different approaches the tools all have the same purpose, to cull down data and increase the efficiency of the review.

Example 2: Concept searching a mail box

Two employees’ mail boxes contain emails relating to “Project Raptor”, online dating, football matches, expenses, and ‘Project Ocean”.  Concept Searching is applied to the data set and, given that projects Raptor and Ocean relate to very different subjects, the emails would be grouped into the 5 different concepts, known as clusters: Football, Expenses, Dating, Ocean and Raptor.

The grouping, or clustering, of the emails will be conducted regardless of keywords, i.e. an email about a football match, which does not contain the word football or match, will still be placed in the football cluster. I.e. an email stating “You still available for the 5-a-side game next week?” would be placed with the other football related emails

Concept Searching – Application

The application of concept searching allows for documents to be culled down rapidly, by focusing in on the documents known to be relevant or removing the clearly irrelevant data

Example 3: Application of Concept Searching

A review team are trying to locate all information relating to Project Raptor. The data sizes are as described in Example 1.

The 3 million documents that are left following the removal of back up data are not keyword searched but simply de-duplicated across the entire data set, removing 30% of the data. This leaves 2 million files.

The remaining data set consists of the following concepts:

Concept Relevant Percentage (from the 2 million files)
Project Raptor 0.25%
Project Ocean 0.25%
Dating 0.5%
Football 5%
Expenses 5%
Junk Mail 25%
Project Whisper 20%
Project Yellow 1.5%
Job Applications/Interviews 5%
Marketing 15%
Company Events 3%
Other Classification 20%

Using these concepts it would be easy to remove/ignore most of the data, leaving only the following concepts:

Concept Relevant Percentage Number of Files
Project Raptor 0.25% 5000
Other Classification 20% 400,000

The 5,000 documents, which are clearly relevant, can then be reviewed in detail, at a cost of $10,000.  These files have been identified without the clumsy approach of keyword searching.

The remaining 400,000 “other classification” documents could then be reviewed in a different manner.

As the 400,000 are unlikely to be relevant to the case, they could either be ignored or a cursory review of them could be conducted. For example, these documents could be keyword searched; reducing the data set to 132,000 documents and a random sample of 30% of these could be reviewed. The cost of reviewing these additional documents would be approximately $87,000.

This methodology of concept searching would reduce the total cost from $1.3 million to $97,000 a reduction in costs of over 90%.

Concept Searching: The Panacea?

In the hypothetical Example 3 there is a perfect break down of concepts, i.e. exactly 5,000 documents were found in the cluster “Project Raptor” [i.e. the exact number of relevant documents as defined in Example 1]. This is clearly unrealistic[5] and the best that could be hoped for is a cluster of a similar magnitude, but more files, e.g. 7,500 files. This would mean that a review team would have more documents to review than required, but far far less than in the original data set, so they are not missing any documents that could be relevant. i.e the concept searching tool should err on the side of caution.

It can be seen that the reduction in cost is massive, but does concept searching really work and is it reasonable?

Concept Searching tools certainly do reduce down volumes of data, but they are certainly not perfect, even a salesman for a concept searching company would not state they are. But are they a reasonable approach to large data volumes or are there better methods to cull the data down?

The Civil Procedure Rules, Part 31, define how a company must review their data with the emphasis on a “reasonable search”.  Section 31.7 of the Civil Procedure Rules states that:

(2)     The factors relevant in deciding the reasonableness of a search include the following –

(a)     the number of documents involved;
(b)     the nature and complexity of the proceedings;
(c)     the ease and expense of retrieval of any particular document; and
(d)     the significance of any document which is likely to be located during the search
.

This would imply concept searching methodology would be allowed, and reasonable, as part of a review. This is because it can radically reduce costs of a review, which could otherwise dwarf the value of the case. Section 31.7.2(d) of Civil Procedure Rules specifically refers to the significance that a document is likely to be located. Using concept searching increases the probability that a significant document is going to be found in a given group or cluster of documents and therefore reviewing only those clusters would be reasonable. If nobody is suggesting that a receptionist’s emails are reviewed fraud case involving the CEO and COO, why review the documents in a “football” cluster?

The concern is not about the ability of concept searching tools to cull the documents down, but rather the accuracy of this.

Keyword searching has known flaws; the wrong keywords, incorrectly spelt keywords either by the review team or the custodian expected to be using the term. Keyword searching is also blunt and often produces a high volume of false positives and false negatives. But, what a keyword searching tool is doing is clearly understood and the errors are known.

Concept searching, however, is undocumented, the exact formulas are often kept hidden and there is a lack of understanding of the technology. What makes an email move from one concept to another, if it discusses two different concepts? What’s the error rate? If the same set of “relevant” email is stored in different data sets, will the concept searching tool correctly identify the same number of relevant emails?

Most importantly is the using concept searching reasonable?

In the case of Abela v Hammond Suddards in 2008, the judge stated that “…[It is N]ot that no stone must be left unturned” but that a “reasonable search” is conducted.

This author would believes that concept searching is reasonable in many, but not all, scenarios.

Foot Notes


[1] This level of email storage would almost never occur, even if it did it would be difficult to prove and therefore backup tapes should be considered.

[2] http://whereismydata.wordpress.com/2009/01/25/case-law-keywords/

[3] http://www.insidecounsel.com/News/2008/9/Pages/Where-angels-fear-to-tread.aspx

[4] Exact citation needed. Taken from Politics of Risk and Fear by Dan Gardner

[5] Given the current state of technology, though in the future this may become more realistic.

Forensics and Electronic Discovery: Proportionality of Document Review

Many people feel, particularly those in the criminal sector, that every document in a case needs to be reviewed. Certainly, there is a push for every document with a “key word hit” to be reviewed

The response to anybody pushing for that should be simply to ask “Why?”.

If a man steals a book from a library, the police will look at the book he has, check with the library to see if its stolen and that’s it. Nobody would suggest that every book in the library or checked.

If a home is broken into and the scenes of crime officer attends he finger prints certain areas.  The points of entry/exit, property that has been moved, draws that have been opened, door handles etc. He does not fingerprint the ceiling, nor does he fingerprint the shower, or light bulbs. The scenes of crime office does not call in  scaffolding company, at costing thousands of pounds, to get on the roof and fingering the chimney stack, just in case the burglar used a Santa Claus style techniques to get into the house.

No, the attending fingerprint officer uses a reasoned approach proportional to the crime.

Despite this why do some people insist in an unreasoned approach to reviewing documents. Why the fear of electronic documents?

Every document does not need to be read in all cases. If there are 10 documents, then yes, they should all be read. but if there are 10 million? Then it would have to be a mass murder to justify that sort of review.

Methods exist to cull data from keyword searching to near de-duplication and concept searching. A reasoned approach to all of these methods should be taken, not just a blanket yes or no to any one.

Case Law: Extreme Porn

NOTE: This article is not correct, see comments below

This week, Alan Moore became the first man to be charged under the newly establish “extreme porn” laws.   This case is interesting for several reasons.

Firstly this follow close on the heels of failed Girls Aloud case. That case which saw Darryn Walker charged under the Obscene Publications act, for material involving “extreme porn”,. That case was dropped by the prosecution on the day of the case. The differences with the “Girls Aloud” case was it involved the “publication” of the material and proof that the person who published it also wrote it – but it was in writing, not in pictures.

Secondly, its a new law and new case. Many people will argue that consensual sex, even violent sex, should be allowed, so it recoding it should also be allowed. But, the famous case of R v Brown, disagrees.

Thirdly, the extreme porn charges have been bundled in with a sexual assualt charges on a teenage girl. It would be a brave liberal who would come to defend a peadophile, once the first case gets through its easier for the second cases to get through.

The phrase “bad cases make bad laws” comes to mind

Follow

Get every new post delivered to your Inbox.

Join 25 other followers