Electronic Discovery: Concept Searching and Error Rates

Assessing the risks and benefits of using conceptual searching technology to cull data, compared with traditional methods of culling data.

As data sizes within companies increase, so do the number of documents available to review. As document sets are now regularly in their hundreds of thousands or even millions reviewing all of these documents is no longer possible; therefore culling of the data has to be conducted. Traditionally this culling has been conducted via the application of keyword searching, de-duplication, and document filtering.

However, due to the sheer scale of documents in modern corporations traditional methods of culling will not always reduce the data to an affordable number of documents. Therefore vendors are increasingly offering new technologies, concept searching and near de-duping, to reduce the data volumes even further. These technologies allow review teams to radically cull documents and complete reviews in time scales that were simply not possible before, but huge volumes of documents are not being reviewed, even though they were collected

What is the risk of the documents not being reviewed and being relevant? Are these tools proportional?


In this working  example data is to be collected, over a period of 1 year, from 21 people. Only email and personal computers are to be collected.

Example 1

  • There are 21 people to be investigated. Each person has a laptop and an unlimited mailbox.
  • A person receives 120 emails per day 5 days a week, 20 days a month, 12 months a year.
  • There is an average of 15,000 files per PC, per person.
  • There are 18 backup tapes per year. 5 daily, 12 monthly, and 1 yearly back up.
  • Users do not delete their email, and therefore the backups contain an exact copy of the live email.
  • 30% of emails sent are between parties within the company and so are duplicates.
  • Keyword filtering reduces the data set by 66%.
  • An member of a review team bills $1000 a day and can review 500 documents a day.
  • There are 5,000 documents and emails, in total, relating to the subject matter.

Data Volume Calculations

Media Source Per Person Entire Data Set
Laptop 15,000 315,000
Email Per Year 144,000 3,024,000
Backup Tapes 2,592,000 54,432,000
Total Data Set 2,751,000 57,771,000

Based on these assumptions, there will be nearly 58 million files, within the available data set. The vast majority of this data will be duplicates. As all of the email captured on the backup tapes (in this scenario) also remains[1] on the live email account, the entire backup set can be filtered out (for calculation purposes), resulting in around 3 million files.

Keyword searching this data, culling at 66%, would leave just over 1 million files.

Further de-duplication across the data set, e.g. the same email between multiple people, would remove 30%, resulting in just over 650,000 files.

The original data set of 58 million documents has been culled to 650,000 documents; this is a 98.9% cull.

Despite the huge cull of data, the cost of an initial review of these documents is expected to be $1,300,000, this is to locate just 5,000 documents.  The costs of hosting and culling the data can be expected to be of a similar order of magnitude.

Errors in Review – Keyword Selection

Despite the huge cost of the review it can be expected that there will be significant errors in the review. In the example above  final cull of data was 99.81%, of this over 2.6 million files were removed either by keywords (chosen by a legal team) or removed during a review (by the legal team).

Keyword choice alone is problematic as highlighted by Mr. Justice Morgan in October 2008 in the DigiCel v Cable and Wireless case (HC07C01917)[2], and by U.S. Magistrate Judge Paul Grimm in Victor Stanley Inc. v. Creative Pipe Inc. 2008 (WL 2221841), who stated[3] that those involved in keyword selection were going where “Angels fear to tread”.

Judge Paul Grimm said that: “Whether search terms or ‘keywords’ will yield the information sought is a complicated question involving the interplay, at least, of the sciences of computer technology, statistics and linguistics…. Given this complexity, for lawyers and judges to dare opine that a certain search term or terms would be more likely to produce information than the terms that were used is truly to go where angels fear to tread. This topic is clearly beyond the ken of a layman.”

Errors in Review – Human Error

In addition to the errors by the selection of keywords errors will be conducted by those reviewing the data.

Measures of human error are difficult to quantify; but one known method is the Human Error Assessment and Reduction Tool, HEART, developed by Williams in 1986. This looks at the variables in a scenario and attempts to predict the error rate. Factors include inexperience, risk misperception, conflict of objectives and low morale, virtually all of which are present, to a significant degree, for a junior member of a review team working on large scale project.

Other studies[4] of human risk perception have shown that people are likely to follow others, even if they are wrong and they know the person they are following is wrong. In addition to this people, develop a mind set to back up their errors and continue in that direction, often repeating that error.

All of this means it is likely that humans will conduct errors within a review

Concept Searching – Theory

Concept searching is the application of technology that looks at the content of documents and attempts to group them together. Different concept searching technology uses different mathematical models to conduct this concept searching including, Bayesian theory, Shannon Information theory and Latent Symantec Indexing. Some concept searching tools are pre-programmed with languages others learn the language from each particular project. Despite the different approaches the tools all have the same purpose, to cull down data and increase the efficiency of the review.

Example 2: Concept searching a mail box

Two employees’ mail boxes contain emails relating to “Project Raptor”, online dating, football matches, expenses, and ‘Project Ocean”.  Concept Searching is applied to the data set and, given that projects Raptor and Ocean relate to very different subjects, the emails would be grouped into the 5 different concepts, known as clusters: Football, Expenses, Dating, Ocean and Raptor.

The grouping, or clustering, of the emails will be conducted regardless of keywords, i.e. an email about a football match, which does not contain the word football or match, will still be placed in the football cluster. I.e. an email stating “You still available for the 5-a-side game next week?” would be placed with the other football related emails

Concept Searching – Application

The application of concept searching allows for documents to be culled down rapidly, by focusing in on the documents known to be relevant or removing the clearly irrelevant data

Example 3: Application of Concept Searching

A review team are trying to locate all information relating to Project Raptor. The data sizes are as described in Example 1.

The 3 million documents that are left following the removal of back up data are not keyword searched but simply de-duplicated across the entire data set, removing 30% of the data. This leaves 2 million files.

The remaining data set consists of the following concepts:

Concept Relevant Percentage (from the 2 million files)
Project Raptor 0.25%
Project Ocean 0.25%
Dating 0.5%
Football 5%
Expenses 5%
Junk Mail 25%
Project Whisper 20%
Project Yellow 1.5%
Job Applications/Interviews 5%
Marketing 15%
Company Events 3%
Other Classification 20%

Using these concepts it would be easy to remove/ignore most of the data, leaving only the following concepts:

Concept Relevant Percentage Number of Files
Project Raptor 0.25% 5000
Other Classification 20% 400,000

The 5,000 documents, which are clearly relevant, can then be reviewed in detail, at a cost of $10,000.  These files have been identified without the clumsy approach of keyword searching.

The remaining 400,000 “other classification” documents could then be reviewed in a different manner.

As the 400,000 are unlikely to be relevant to the case, they could either be ignored or a cursory review of them could be conducted. For example, these documents could be keyword searched; reducing the data set to 132,000 documents and a random sample of 30% of these could be reviewed. The cost of reviewing these additional documents would be approximately $87,000.

This methodology of concept searching would reduce the total cost from $1.3 million to $97,000 a reduction in costs of over 90%.

Concept Searching: The Panacea?

In the hypothetical Example 3 there is a perfect break down of concepts, i.e. exactly 5,000 documents were found in the cluster “Project Raptor” [i.e. the exact number of relevant documents as defined in Example 1]. This is clearly unrealistic[5] and the best that could be hoped for is a cluster of a similar magnitude, but more files, e.g. 7,500 files. This would mean that a review team would have more documents to review than required, but far far less than in the original data set, so they are not missing any documents that could be relevant. i.e the concept searching tool should err on the side of caution.

It can be seen that the reduction in cost is massive, but does concept searching really work and is it reasonable?

Concept Searching tools certainly do reduce down volumes of data, but they are certainly not perfect, even a salesman for a concept searching company would not state they are. But are they a reasonable approach to large data volumes or are there better methods to cull the data down?

The Civil Procedure Rules, Part 31, define how a company must review their data with the emphasis on a “reasonable search”.  Section 31.7 of the Civil Procedure Rules states that:

(2)     The factors relevant in deciding the reasonableness of a search include the following –

(a)     the number of documents involved;
(b)     the nature and complexity of the proceedings;
(c)     the ease and expense of retrieval of any particular document; and
(d)     the significance of any document which is likely to be located during the search

This would imply concept searching methodology would be allowed, and reasonable, as part of a review. This is because it can radically reduce costs of a review, which could otherwise dwarf the value of the case. Section 31.7.2(d) of Civil Procedure Rules specifically refers to the significance that a document is likely to be located. Using concept searching increases the probability that a significant document is going to be found in a given group or cluster of documents and therefore reviewing only those clusters would be reasonable. If nobody is suggesting that a receptionist’s emails are reviewed fraud case involving the CEO and COO, why review the documents in a “football” cluster?

The concern is not about the ability of concept searching tools to cull the documents down, but rather the accuracy of this.

Keyword searching has known flaws; the wrong keywords, incorrectly spelt keywords either by the review team or the custodian expected to be using the term. Keyword searching is also blunt and often produces a high volume of false positives and false negatives. But, what a keyword searching tool is doing is clearly understood and the errors are known.

Concept searching, however, is undocumented, the exact formulas are often kept hidden and there is a lack of understanding of the technology. What makes an email move from one concept to another, if it discusses two different concepts? What’s the error rate? If the same set of “relevant” email is stored in different data sets, will the concept searching tool correctly identify the same number of relevant emails?

Most importantly is the using concept searching reasonable?

In the case of Abela v Hammond Suddards in 2008, the judge stated that “…[It is N]ot that no stone must be left unturned” but that a “reasonable search” is conducted.

This author would believes that concept searching is reasonable in many, but not all, scenarios.

Foot Notes

[1] This level of email storage would almost never occur, even if it did it would be difficult to prove and therefore backup tapes should be considered.

[2] https://whereismydata.wordpress.com/2009/01/25/case-law-keywords/

[3] http://www.insidecounsel.com/News/2008/9/Pages/Where-angels-fear-to-tread.aspx

[4] Exact citation needed. Taken from Politics of Risk and Fear by Dan Gardner

[5] Given the current state of technology, though in the future this may become more realistic.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: