UK Law | Where is Your Data?

UK Supreme Court to open on Thursday – 1st October 2009

September 30, 2009 — 585

The UK Supreme Court will open on 1st October 2009:

The Outlaw Article on the subject is below:

The UK’s legal landscape will change tomorrow when the Supreme Court takes over from the House of Lords as the country’s highest court.

As part of a Government move to ensure that the highest court in the country is completely independent from those who make the law it created the Supreme Court, which will occupy its own new building and be made up of 12 judges.

Until now the most senior court was made up of the Appellate Committee of the House of Lords and the Judicial Committee of the Privy Council. The House of Lords and the Privy Council are parts of Government, but The Supreme Court will not be connected to Government at all.

The Court said that it would now be “explicitly separate from both Government and Parliament”.

Eleven of the 12 judges that form the Appellate Committee of the House of Lords will become judges at the Supreme Court. One, Lord Neuberger, is moving to the Court of Appeal, where he will be head of its civil division.

Lord Neuberger has expressed reservations about the new structure and has said that the very independence from Government that is its founding principle might lead it to assume too much power.

“The danger is that you muck around with a constitution like the British constitution at your peril because you do not know what the consequences of any change will be,” he told BBC Radio 4 earlier this month. “[There is a risk of] judges arrogating to themselves greater power than they have at the moment.”

Electronic Discovery: Concept Searching and Error Rates

July 29, 2009 — 585

Assessing the risks and benefits of using conceptual searching technology to cull data, compared with traditional methods of culling data.

As data sizes within companies increase, so do the number of documents available to review. As document sets are now regularly in their hundreds of thousands or even millions reviewing all of these documents is no longer possible; therefore culling of the data has to be conducted. Traditionally this culling has been conducted via the application of keyword searching, de-duplication, and document filtering.

However, due to the sheer scale of documents in modern corporations traditional methods of culling will not always reduce the data to an affordable number of documents. Therefore vendors are increasingly offering new technologies, concept searching and near de-duping, to reduce the data volumes even further. These technologies allow review teams to radically cull documents and complete reviews in time scales that were simply not possible before, but huge volumes of documents are not being reviewed, even though they were collected

What is the risk of the documents not being reviewed and being relevant? Are these tools proportional?

Assumptions

In this working example data is to be collected, over a period of 1 year, from 21 people. Only email and personal computers are to be collected.

Example 1

There are 21 people to be investigated. Each person has a laptop and an unlimited mailbox.
A person receives 120 emails per day 5 days a week, 20 days a month, 12 months a year.
There is an average of 15,000 files per PC, per person.
There are 18 backup tapes per year. 5 daily, 12 monthly, and 1 yearly back up.
Users do not delete their email, and therefore the backups contain an exact copy of the live email.
30% of emails sent are between parties within the company and so are duplicates.
Keyword filtering reduces the data set by 66%.
An member of a review team bills $1000 a day and can review 500 documents a day.
There are 5,000 documents and emails, in total, relating to the subject matter.

Data Volume Calculations

Media Source	Per Person	Entire Data Set
Laptop	15,000	315,000
Email Per Year	144,000	3,024,000
Backup Tapes	2,592,000	54,432,000
Total Data Set	2,751,000	57,771,000

Based on these assumptions, there will be nearly 58 million files, within the available data set. The vast majority of this data will be duplicates. As all of the email captured on the backup tapes (in this scenario) also remains[1] on the live email account, the entire backup set can be filtered out (for calculation purposes), resulting in around 3 million files.

Keyword searching this data, culling at 66%, would leave just over 1 million files.

Further de-duplication across the data set, e.g. the same email between multiple people, would remove 30%, resulting in just over 650,000 files.

The original data set of 58 million documents has been culled to 650,000 documents; this is a 98.9% cull.

Despite the huge cull of data, the cost of an initial review of these documents is expected to be $1,300,000, this is to locate just 5,000 documents. The costs of hosting and culling the data can be expected to be of a similar order of magnitude.

Errors in Review – Keyword Selection

Despite the huge cost of the review it can be expected that there will be significant errors in the review. In the example above final cull of data was 99.81%, of this over 2.6 million files were removed either by keywords (chosen by a legal team) or removed during a review (by the legal team).

Keyword choice alone is problematic as highlighted by Mr. Justice Morgan in October 2008 in the DigiCel v Cable and Wireless case (HC07C01917)^{^[2]}^, and by U.S. Magistrate Judge Paul Grimm in Victor Stanley Inc. v. Creative Pipe Inc. 2008 (WL 2221841), who stated^{^[3]}that those involved in keyword selection were going where “Angels fear to tread”.

Judge Paul Grimm said that: “Whether search terms or ‘keywords’ will yield the information sought is a complicated question involving the interplay, at least, of the sciences of computer technology, statistics and linguistics…. Given this complexity, for lawyers and judges to dare opine that a certain search term or terms would be more likely to produce information than the terms that were used is truly to go where angels fear to tread. This topic is clearly beyond the ken of a layman.”

Errors in Review – Human Error

In addition to the errors by the selection of keywords errors will be conducted by those reviewing the data.

Measures of human error are difficult to quantify; but one known method is the Human Error Assessment and Reduction Tool, HEART, developed by Williams in 1986. This looks at the variables in a scenario and attempts to predict the error rate. Factors include inexperience, risk misperception, conflict of objectives and low morale, virtually all of which are present, to a significant degree, for a junior member of a review team working on large scale project.

Other studies^{^[4]} of human risk perception have shown that people are likely to follow others, even if they are wrong and they know the person they are following is wrong. In addition to this people, develop a mind set to back up their errors and continue in that direction, often repeating that error.

All of this means it is likely that humans will conduct errors within a review

Concept Searching – Theory

Concept searching is the application of technology that looks at the content of documents and attempts to group them together. Different concept searching technology uses different mathematical models to conduct this concept searching including, Bayesian theory, Shannon Information theory and Latent Symantec Indexing. Some concept searching tools are pre-programmed with languages others learn the language from each particular project. Despite the different approaches the tools all have the same purpose, to cull down data and increase the efficiency of the review.

Example 2: Concept searching a mail box

Two employees’ mail boxes contain emails relating to “Project Raptor”, online dating, football matches, expenses, and ‘Project Ocean”. Concept Searching is applied to the data set and, given that projects Raptor and Ocean relate to very different subjects, the emails would be grouped into the 5 different concepts, known as clusters: Football, Expenses, Dating, Ocean and Raptor.

The grouping, or clustering, of the emails will be conducted regardless of keywords, i.e. an email about a football match, which does not contain the word football or match, will still be placed in the football cluster. I.e. an email stating “You still available for the 5-a-side game next week?” would be placed with the other football related emails

Concept Searching – Application

The application of concept searching allows for documents to be culled down rapidly, by focusing in on the documents known to be relevant or removing the clearly irrelevant data

Example 3: Application of Concept Searching

A review team are trying to locate all information relating to Project Raptor. The data sizes are as described in Example 1.

The 3 million documents that are left following the removal of back up data are not keyword searched but simply de-duplicated across the entire data set, removing 30% of the data. This leaves 2 million files.

The remaining data set consists of the following concepts:

Concept	Relevant Percentage (from the 2 million files)
Project Raptor	0.25%
Project Ocean	0.25%
Dating	0.5%
Football	5%
Expenses	5%
Junk Mail	25%
Project Whisper	20%
Project Yellow	1.5%
Job Applications/Interviews	5%
Marketing	15%
Company Events	3%
Other Classification	20%

Using these concepts it would be easy to remove/ignore most of the data, leaving only the following concepts:

Concept	Relevant Percentage	Number of Files
Project Raptor	0.25%	5000
Other Classification	20%	400,000

The 5,000 documents, which are clearly relevant, can then be reviewed in detail, at a cost of $10,000. These files have been identified without the clumsy approach of keyword searching.

The remaining 400,000 “other classification” documents could then be reviewed in a different manner.

As the 400,000 are unlikely to be relevant to the case, they could either be ignored or a cursory review of them could be conducted. For example, these documents could be keyword searched; reducing the data set to 132,000 documents and a random sample of 30% of these could be reviewed. The cost of reviewing these additional documents would be approximately $87,000.

This methodology of concept searching would reduce the total cost from $1.3 million to $97,000 a reduction in costs of over 90%.

Concept Searching: The Panacea?

In the hypothetical Example 3 there is a perfect break down of concepts, i.e. exactly 5,000 documents were found in the cluster “Project Raptor” [i.e. the exact number of relevant documents as defined in Example 1]. This is clearly unrealistic[5] and the best that could be hoped for is a cluster of a similar magnitude, but more files, e.g. 7,500 files. This would mean that a review team would have more documents to review than required, but far far less than in the original data set, so they are not missing any documents that could be relevant. i.e the concept searching tool should err on the side of caution.

It can be seen that the reduction in cost is massive, but does concept searching really work and is it reasonable?

Concept Searching tools certainly do reduce down volumes of data, but they are certainly not perfect, even a salesman for a concept searching company would not state they are. But are they a reasonable approach to large data volumes or are there better methods to cull the data down?

The Civil Procedure Rules, Part 31, define how a company must review their data with the emphasis on a “reasonable search”. Section 31.7 of the Civil Procedure Rules states that:

(2) The factors relevant in deciding the reasonableness of a search include the following –

(a)     the number of documents involved;
(b)     the nature and complexity of the proceedings;
(c)     the ease and expense of retrieval of any particular document; and
(d)     the significance of any document which is likely to be located during the search.

This would imply concept searching methodology would be allowed, and reasonable, as part of a review. This is because it can radically reduce costs of a review, which could otherwise dwarf the value of the case. Section 31.7.2(d) of Civil Procedure Rules specifically refers to the significance that a document is likely to be located. Using concept searching increases the probability that a significant document is going to be found in a given group or cluster of documents and therefore reviewing only those clusters would be reasonable. If nobody is suggesting that a receptionist’s emails are reviewed fraud case involving the CEO and COO, why review the documents in a “football” cluster?

The concern is not about the ability of concept searching tools to cull the documents down, but rather the accuracy of this.

Keyword searching has known flaws; the wrong keywords, incorrectly spelt keywords either by the review team or the custodian expected to be using the term. Keyword searching is also blunt and often produces a high volume of false positives and false negatives. But, what a keyword searching tool is doing is clearly understood and the errors are known.

Concept searching, however, is undocumented, the exact formulas are often kept hidden and there is a lack of understanding of the technology. What makes an email move from one concept to another, if it discusses two different concepts? What’s the error rate? If the same set of “relevant” email is stored in different data sets, will the concept searching tool correctly identify the same number of relevant emails?

Most importantly is the using concept searching reasonable?

In the case of Abela v Hammond Suddards in 2008, the judge stated that “…[It is N]ot that no stone must be left unturned” but that a “reasonable search” is conducted.

This author would believes that concept searching is reasonable in many, but not all, scenarios.

Foot Notes

[1] This level of email storage would almost never occur, even if it did it would be difficult to prove and therefore backup tapes should be considered.

[2] https://whereismydata.wordpress.com/2009/01/25/case-law-keywords/

[3] http://www.insidecounsel.com/News/2008/9/Pages/Where-angels-fear-to-tread.aspx

[4] Exact citation needed. Taken from Politics of Risk and Fear by Dan Gardner

[5] Given the current state of technology, though in the future this may become more realistic.

Posted in 1 - e-discovery, Concept Searching. Tags: 2 - Law, Concept Searching, CPR, CPR 31, eDiscovery, electronic disocvery, UK Law. Leave a Comment »

Electronic Discovery: Lord Jackson Report

July 28, 2009 — 585

Update: Following a recent debate by about the nature of this article the following points should be highlighted.

At no point is any blame attributed to any individual. The aim of this article was to raise an issue, a small technical issue, in a small chapter, of a very large document.

The primary issue is that exchange of PST files verses a the exchange of a load file. This is a spectacularly insignificant issue when dealing with litigation in the millions or billions, and in terms of Jackson’s report as a whole. Unfortunately the author of the article happens to be one of those who has to deal with that exchange of documents, and so it becomes very important for those who have to conduct this type of work. The article covers why, from a technical perspective, one suggestion mentioned by Jackson is probably not effective, from a technical perspective.

It is not be suggested that Justice Jackson would ever be expected to know about this subject. Its not even something a lawyer dealing with a case would every really know about, its far too low level.

The other issue is the expectation of costs. While lawyers can predict their costs with the information available as can vendors, predicting those costs, early on, without accurate information, is very difficult and can be misleading. Its not unusual for vendors/consultants to be told that there will be X amount of data for Y custodians and X or Y to increase 2, 3 or even 10 fold. This can result in a 2, 3 or 10 fold increase in costs. Until the vendors and consultants get their hands on the data, have conducted their data mapping exercise, and started their processing, they often don’t have a good grip on the costs.

Vendors are aware of this and are reticent to over “promise and under deliver”.

Original Article

The report by Lord Justice Jackson into litigation costs in the UK, published last month, is seen as a major factor in influencing electronic discovery in the UK, with Chapter 40 dedicated to Electronic Discovery.

It may well be as critical as commentators suggest, but that does not mean it’s going to be influencing the industry in the correct way.

In Paragraph 5, Chapter 40 (Volume2) of the report, Lord Jackson states that:

5.10 Format of disclosure. Parties sometimes make disclosure in a format which is

unhelpful to the other side and which requires duplication of work already carried

out by the disclosing party. This causes duplication of costs. It has been suggested

that this could be avoided by the parties being required to make disclosure in a

suitable format. One example is for disclosure to be of PST files in native format

rather than disclosure of the image of the document (TIFFS or individual message

files). Much valuable information in the PST file is lost in the process of conversion

to TIFFs or message files.

5.11 It has been emphasised by practitioners that it is vital to preserve the context

of a document, for example by preserving the folder structure of the data, or the file

hierarchy (and making disclosure in native format), rather than disclosing documents

separately and not within their original folders. The paper equivalent would be to

disclose files of papers which are labelled by subject or by person (showing their place

in the relevant transactions), rather than to disclose an enormous box of papers in

unlabelled files with, for example, meeting minutes scattered throughout. One

experienced practitioner has suggested that parties should be required to produce

disclosure in a manner that is cost effective for both parties, in the event that a

specific format cannot be agreed upon.

This statement strongly implies that Lord Jackson has been given some very bad advice as even a cursory knowledge of the electronic discovery procedures would see the glaring problems with this statement:

One example is for disclosure to be of PST files in native format rather than disclosure of the image of the document

Some of the reasons this statement is so far removed from the reality of electronic discovery processing are as follows:

1) The first thing that happens to a PST file, for processing, is that its broken into MSG files so that the files within it can be processed.

Therefore for PST files to be exchanged the PST file would be broken into messages, then reassembled so it can be broken down again. This is the very definition of duplication of effort.

2) The whole of a PST file unlikely to be disclosed, certain messages or attachments will not be passed to the other side, making the exchange of a whole of a PST impossible.

3) Native files are often not exchanged for the purposes of confidentiality; TIFF files can be redacted and native files cannot.

4) Review platforms do not review PST files, so why exchange in format that is not going to be reviewed?

5) The issue of folder paths within PST files is an important one, but the path of a PST file can be preserved during review and therefore that information can be exchanged, it just needs to be requested. If it can’t be displayed in review, it can’t be exchanged, in which case the provision of the whole PST file is not going to be of assistance anymore than exchanging the standard load file.

Finally, and most importantly, it’s simply a really bad idea.

Two companies who exchange documents correctly, following an agreed standard, e.g a concordance load file, with all of the fields correctly aligned can expect to have the data exchanged and up and running, ready for review, on the same day of the exchange. However, a company given tens of thousands of emails and documents, in PST files will spend a long time loading the PST files into their processing engine, breaking them into MSG files and attachments, dealing with the errors, corruption, and encryption issues as well as any long file paths or other problems that are seen.

Eventually, after all of the processing and error correction the electronic discovery company will end up with native files, text files, and a load file, data which can then be placed into a review platform. This data, which the eDiscovery company have produced, after all the work and cost, is the exact same product that is currently available for exchange in the standard method. This processing of a PST file would be done after the other side had conducted the same process, and then reversed it to create a PST file.

The process being hinted at by Lord Justice Jackson would look something like this:

1) Party A – Break PST files into MSG Files

2) Party A – Deal with errors/corruption/encryption

3) Party A – Extract Text

4) Party A – Create Load File

5) Party A – Create PST files

6) Party A – Exchange PST with Party B

7) Party B – Break PST files into MSG Files

8) Party B – Deal with errors/corruption/encryption

9) Party B – Extract Text

10) Party B – Create Load File

11) Party B – Load data into review

Rather than the current system which is

1) Party A – Break PST files into MSG Files

2) Party A – Deal with errors/corruption/encryption

3) Party A – Extract Text

4) Party A – Create Load File

5) Party A – Exchange Load File with Party B

6) Party B – Load data into review

Form this it can be seen that the method hinted at by Lord Jackson quite literally doubles the amount of work. For a document discussing the increasing costs of litigation and how this can be addressed this is surely not what is aimed for.

Justice Jackson cannot be expected to know this information, and therefore his advisors have to be blamed.

In Section 5.13 Justice Jackson also states that:

5.13 Practitioners and judges familiar with e-disclosure also recommend that

parties obtain estimates of the potential e-disclosure costs at an early stage. Such

estimates can be discussed between the parties and produced to the court if

necessary. Indeed, the production of such estimates would be an essential step if the

court is going to undertake “costs management” of business litigation, as discussed in

chapter 48 below.

This is one of the most common requests made by solicitors of electronic discovery vendors. How much will it cost? How long will it take? Vendors are always seen to be dragging their heels on this issue.

This is not because they are unwilling to quote, but they have a lack of information.

The costs of electronic discovery processing can be described as a function of data size (GB), number of files, number of errors.

The more data, the more files, the more errors the more it costs. Unfortunately all of this information is available until the project has been all but completed.

Data sizes: The data size estiamtes provided by clients are often wildly in accurate. Statements such as “there will be no more than 500mb of data per custodian, and there will be 10 custodians” often result in 20 custodians with 2 GB of data each, being delivered. It is these huge changes in data that cause problems for vendors to quote.

File Count: The more files there more work there is to do. 2 GB of data could be a single file, or it could be 2 million files, or anywhere in between. Its not until the processing has started that this can be done.

Errors: The more errors and the more complex they are the longer the project will take and the more it will cost.

In addition to these costs, there is the cost of review. But this cannot be known until the data has been loaded into review. 10 GB going into processing of data could be 100,000 files which is culled down to just 1,000 files, which are quick to review. Or is could be 50,000 files that is culled down to 40,000 files.

There are a lot of variables making it impossible to give an honest assessment of cost early on. Vendors could and should do a better job at estimating costs and data sizes, with sampling, concept searching, and early case assessment.

Conclusion: From reading Chapter 40 there appears to have been either bad advice given to Lord Justice Jackson, or those with the technical knowledge have failed to get their message across.

Posted in 1 - e-discovery, Concept Searching. Tags: CPR, CPR 31, eletronic discovery, litigation, UK Law. 1 Comment »

Girls (Scream) Aloud: A Test Case?

June 30, 2009 — 585

Is the Girls (Scream) Aloud case really a test case? Does it affect any laws?

The horror/porn/rape story of involving Girls Aloud has generated thousands of articles, and huge media interest as it was to be biggest test of the obscene publications act since the Lady Chatterley case.

The rules of the game where simple:

Darryn Walker was arrested for writing and publishing a story that, by any “normal” moral standard is awful, he would be tried in court in relation to definition of obscenity. The outcome of the game was one of two options:

1) If Mr. Walker was found guilty then this would be a landmark decision on what cannot be published on the internet, this would be a “triumph for moral decency”.

2) If Mr. Walker was found not guilty, then it would a “triumph for freedom of speech”. This would effectively mean that the Obscene publications act was dead.

People on both sides of the divide were vocal in the fact that they were, without doubt, right.

In a country where Nuts and Zoo make soft porn common place, the idea that Lady Chatterley’s lover was banned, for so long, seems astounding. But even in these times where, according to a BBC survey, the most popular career choice for young female British teens is “Glamour Model”, the Girls (Scream) Aloud story is truly offensive, by most standards.

But does that mean it should be banned?

If it had just been written in a book the case, and the ramifications, would have been more clearly defined.But it was published on the internet the case was far more complex. To make matters more interesting, the servers the story was originally published in where outside of the UK.

Was the UK going to legislate about data outside of the UK? Parliament can do that, but would they?

The Obscene Publications Act is currently aimed at UK material, would the courts want to effectively change legislation? Because of these questions the case had massive consequences. The UK Government could have used their ability to censor material on the internet with technology, rather than making a case out of it. The government use technology to filter the internet regulary and with controversial subjects. Or they could have made a deal for the data to be removed.

Despite other options being available, the decision was made to press charges.

The result of this huge trial? Not Guilty.

So a triumph for freedom of speech? Well not quite.

On the day of the judgment, 29th June 2009, Mr Walker was found not guilty, but this was because the Crown Prosecution Service, effectively walked away from the case when they offered no evidence.

This was largely because the prosecution had stated that the material could be found with the simple searches, and possibly by those who are genuinely looking for information on Girls Aloud. However the defense provided evidence that showed the story could only be found with specific searches. With this the CPS withdrew, and the judge submitted a formal verdict of not guilty.

But as no judgment was given about the case, the facts were not discussed or debated. As a result this is not really a test case, though it’s probably a show of the CPS resolve, or lack of it, to prosecute in these circumstances.

Posted in Censorship, UK Law. Tags: 2 - Law, Case Law, Girls Aloud, Girls Scream Aloud, Internet Censorship, UK Law. 1 Comment »

Civil Law: Mareva injunction

May 30, 2009 — 585

The Mareva injunction (also known also as a freezing order, Mareva order or Mareva regime), in Commonwealth jurisdictions, is a court order which freezes assets so that a defendant to an action cannot dissipate their assets from beyond the jurisdiction of a court so as to frustrate a judgment. It is named for Mareva Compania Naviera SA v International Bulkcarriers SA [1975] 2 Lloyd’s Rep 509, decided in 1975, although the first recorded instance of such an order in English jurisprudence was Nippon Yusen Kaisha v Karageorgis in 1975, decided very shortly before the Mareva decision; however, in the UK the Civil Procedure Rules 1998 now define a Mareva order as a “freezing” order. It is widely recognised in other common law jurisdictions and such orders can be made to have world-wide effect. It is variously construed as part of a court’s inherent jurisdiction to restrain breaches of its process.

It is not a security (Jackson v Sterling Industries Ltd), nor a means to pressure a judgment debtor (Camdex International Ltd v Bank of Zambia (No. 2)), nor does it confer a proprietary interest in the assets of the judgment debtor (Cretanor Maritime Co Ltd v Irish Marine Management Ltd). However, some authorities have treated the Mareva injunction as an order to stop a judgment debtor from dissipating his assets so as to have the effect of frustrating judgment, rather than the more strenuous test of requiring an intent to abuse court procedure. An example of the former would be paying off a legitimate debt (Iraqi Ministry of Defence v Arcepey Shipping Co SA), whereas an example of the latter would be hiding the assets in overseas banks on receiving notice of the action.

Source Wiki

Posted in 2 - Law. Tags: 2 - Law, Freeze Assets, Mareva, UK Law. Leave a Comment »

« Older posts

Where is Your Data?

Articles Categories