Electronic Discovery: Concept Searching & Keyword Searching

Keyword Searching & Concept Searching

In the wake of the DigiCel case, and the constant strides in concept searching (not to mention the hefty marketing budget spendton advertising the tools) does this mean that keyword searching is a thing of the past?

In the preceding article, in which concept searching is discussed,  an example of where concept searching would trump keyword searching was used, this same example is discussed below.

Example 1:

“Do you want to watch football tonight, I bet Chelsea scores?”

Example 2

“Do you want to watch the game tonight, I bet Chelsea scores?”

If the term “football” was used as a keyword, then Example 1 would be found, but Example 2 would not.  But they are both clearly about football. Concept searching tool would be expected to “find” both examples, because “Concept Searching” is, as the name implies, actually looking for “concepts” rather than just keywords.

If a keyword search was to be conducted, then the keyword search list could be expanded to include “game”, but then following sentence would not be found

  • Are you going to see the match tonight?

Therefore the word “match could be added. But what about the  sentence:

  • Fancy a game of footy tonight?

There are, of course, numerous combinations in which a football match can be talked about without mentioning the term “football”.

If a keyword search list was going to be used to identify football, then the keyword list would have to be quite comprehensive, and include the following terms.

  • Football
  • Soccer
  • Match
  • Score
  • Game
  • Footy
  • Five a side
  • 5 a side
  • Players
  • League
  • Cup

This list is not guaranteed to bring every email and document about football back, but it is guaranteed to bring back a lot of none relevant material. It is immediately obvious that the term “cup” and “match” could have lots of meanings, other than football. If the document set is a couple of hundred thousand documents, or even a couple of million, then amount of false positives produced will be huge. In this case the keyword search list would not be very effective.

Concept searching  could cluster the documents of a similar nature together and therefore help to find the football related emails.

In this trivial example we can see the failings of keyword searching and the benefits of concept searching, but that does not mean that concept searching has replaced keyword searching; far from it.

The use of keyword searching in a concept based world

Even if concept searching is used, reviewers will still need to find documents within clusters.  Going back to the football example, once the documents have been clustered a keyword search for “football”, could help identify the relevant cluster.

Equally it may be that there are so many documents to start with that an initial cull of documents with a very broad keyword search, could be justified, before going into a concept based review. Taking a document set from 10 million to 1 million documents, could be seen to be seen as an economical way to approach a review. The reverse is also true, where a concept search could be conducted and then a keywords search applied, to help focus in on the core documents quickly.

The use of keywords will not only depend on the case, but the tools and the service being used.  Some companies will apply keywords, and then only those documents responsive to the keyword search will ever be available for review. Other companies/services will apply the keyword search, after the data was loaded into a review platform. These two different service offerings can make a huge difference to the legal strategy, the results, and the costs.

The next article on keywords will discuss how and why.


Electronic Discovery: What is Concept Searching?

There is much discussion about concept searching, its benefits, etc (and there are  quite a few articles on this site about the very subject). But what is it, and how is it deployed?

What is concept searching?

Concept searching is a method of searching files not based on keywords, but on the subject matter of the document, paragraph, or sentence.  This is different to keyword searching which requires an exact keyword hit.

Example 1:

“Do you want to watch football tonight, I bet Chelsea scores?”

Example 2

“Do you want to watch the game tonight, I bet Chelsea scores?”

If the term “football” was used as a keyword, then Example 1 would be found, but Example 2 would not.  But they are both clearly about football. However, a concept searching tool would be expected to “find” both examples, because “Concept Searching” is, as the name implies, actually looking for “concepts” rather than just keywords.

How is Concept Searching Deployed?

Concept Searching can be deployed in a variety of methods, and with many different names, depending on the vendor/service provider/consultant. We will attempt to cover the most common names, and methodologies, below.


This is, perhaps, the most commonly known use of concept searching. It takes a group of data and then breaks it into groups of “similar documents”. E.g one group could be about documents relating to football, another about meetings, etc. The number and size of the documents would depend on the documents, and the concept searching tools being used. For example 10,000 documents, could be broken in 10 groups of 1000 documents each, or 1000 groups of 10 documents each. Equally there could be 1 group of 5,000 and 5 groups of 1,000. There are a near infinite number of combinations. The more advanced tools on the market allow the operator a degree manipulation on the sizing and nature of the groups.

The advantage of these groupings is that it allows the effective focusing of resources. e.g. The groupings that appear to be about football, parties, and other junk material can either be not reviewed, or just scanned quickly. If there are 500 documents in one group, and a sample review of that group shows that they all relate to fantasy football, and all the email titles appear to relate to fantasy football, then it may (case dependant) be reasonable to skip the rest of the group. Equally, if there is a group of 1,000 documents, related to “contracts”, the senior reviewers can be dedicated to the cluster, and a detailed review can be conducted early on in the case. Rather than going through one or two reviewers before those documents get reviewed in detail.

Auto-Tagging/Predictive Marking

This methodology works on the same technology, identification of documents through concepts, but rather than creating multiple groups it will create a couple of groups or possibly only one. Generally a small sample of documents are provided, which are similar to each other, to search agains ta very large number of unknown documents.  The concept searching tool will search the large number of unknown documents for documents which are “similar” to the known document set. Documents that are found to be similar will then be identified and clustered together. This type of technology can be deployed in several different ways, e.g.

  • Looking for documents which are similar to a document already disclosed
  • Looking for documents that are similar to documents that are known to be junk, e.g. if there is a lot of social networking email traffic, this could be used to identify much of the spam and remove it from the review set
  • If  “hot documents” are found during the initial review, these can be used to identify other similar documents.

Concept  Searching “words”

Some concept searching tools allow the searching of words or paragraphs. In these circumstances the tool is doing just what is done above, but on a more focused paragraph or sentence rather than an entire document. This is particularly important when dealing with large documents; if there is a 500 page document that is  important or “hot”, because of one paragraph, concept searching for similar documents will often produce junk. In these cases concept searching for a key paragraph may be more effective.

Keyword Searching, is this still needed?  With the advent of concept searching, should keyword searching still be conducted? This subject will be covered in the next article on concept searching.

Electronic Discovery: Francis Bacon and Concept Searching

Human errors and the human state of mind have a big effect on the decision making process in electronic discovery, but what are these errors?

A few hundred years ago Francis Bacon stated: The human understanding when it has once adopted an opinion (either as being the received opinion or as being agreeable to itself) draws all things else to support and agree with it. And though there be a greater number and weight of instances to be found on the other side, yet these it either neglects and despises, or else by some distinction sets aside and rejects; in order that by this great and pernicious predetermination the authority of its former conclusions may remain inviolate

What Francis Bacon was saying, rather elegantly, is that people stick to their guns. What he believed to be true in the 17th Century, psychologist in the 20th and 21st century have now shown to be true.

Humans Examining Evidence

It has been demonstrated by psychologist that people often not only stick to their beliefs but they seek out evidence to re-enforce their own opinions and reject new evidence which is contrary to their belief. Test after test has shown this, and it can be seen in real life situations, from politicans to generals. In the run up Pearl Harbor there where there were numerous warnings that an attack was about to occur, including a Japanese submarine that was sunk just outside the harbor only 1 hour before the attack.  But the admiral in charge, Admiral Kimmel, had believed that Japan would not attack Perl Harbor, and so ignored the information and deliberately misinterpreted information, intelligence, and warnings – he stuck to his guns[1]. He did not even cancel weekend leave for his staff, or ask if the Army were manning the anti-aircraft guns (this would have only required a single phone call). This is not uncommon and people, of all levels, do this quite regularly. This state of mind effects scientist and politician alike.

Humans making decisions

Not only do humans often stick to incorrect decisions, but we are easily influenced. For example people will often follow something known as the “rule of primacy”. This says, in short, that the first thing people learn about a subject, they take to be true.

Example: If a person is told that Car A is fantastic by a friend of theirs, then they will tend to believe that, even if they are later told that Car A is less than good. In fact they will seek out evidence to support what they already believe.

Another well known cause of errors in humans is the availability error. This means that the stronger the memory, the more powerful the memory, than the more likely people are to make decisions based on that information. This has been shown in labs and the real world. For example, earthquake insurance in areas that have earthquakes increases immediately after a quake but decreases the longer it has been since an earthquake has occurred  – because the memory of the quake fades. However, the probability of quake increases the longer the time between quakes and safest after the quake. I.e. people are buying and not buying insurance at exactly the wrong time. Equally if people are asked to estimate which is the more common, words with beginning[2] with the letter “R” or having “r” as the third letter they will often say the former, as they can immediately think of words beginning with the R. Rain, rainbow, rivet, red, real, reality, etc, But, in fact there are more words with the third letters as “r”, street, care, caring, borrow, etc. However the people have the first letter “R” strongest in their mind, so that is what they believe.

Other well known causes of human errors include:

Peer pressure/conformity. People tend to follow the decisions of others, even when they are quite obviously wrong. There are well known examples tests of this, such as a person being put in a room with 5 or 10 other people and asked to complete simple tests such as say which is the shorter of three lines, or how many beats of a drum there was. The tests are simple; the beats would be easy to count or one of the lines would be obviously shorter, but the test subject would be in a room with 5 or 10 other people who were actors, paid to deliberately give the same wrong answer. If the answers of everybody were read out aloud the test subject would, more often than not, follow the incorrect answers.

Obedience/Management pressure: Following the opinion of their superior (depending on the culture), regardless of if it right is something that can often occur. This was most famously demonstrated in the Stanley Milgram tests where volunteers willingly applied enough voltage to kill other innocent people, simply because they were asked to.

There are many more examples proving these human conditions, and even more conditions that cause us to make errors on a day to day basis – it’s just the nature of the human brain. It is how we work (or don’t).

Electronic Discovery & Psychology

So what has all of this psychology and “soft science” got to do with electronic discovery?

Electronic discovery is historically driven by humans, from keyword selection to the relevance of a document.  It is the errors identified above, and more, that can come into play during a review.

Below are examples of how these well known errors can affect a review:

  • Once a person decides on a keyword search criteria, once they have put their flag in the ground, they are, statistically, unlikely to be willing to change their mind about the value of the search criteria. They may even ignore evidence or documents that could prove otherwise. In fact research has shown that once a person states publically, or commits a decision to writing, they are even more likely to stick to their guns than somebody who makes that decision privately.
  • If a person has to review another 500 page document, and it’s late and they want to go home, then may quickly start to believe that this document is not relevant.  They may start to scan  the document looking for information that demonstrates that the document is not relevant, rather than looking for evidence that shows it is relevant.
  • A second opinion on document’s relevance may be sought, from a senior manager or colleague, and that opinion will then be followed through the review, regardless of it was right or not. Even if the original reviewer believes the opinion to be wrong.
  • If a review platform does not record who made the decision of if a document is relevant or not, then the reviewer may be less inclined to be so diligent, as they are removed from their responsibility by anonymity. [Anonymity has also been shown to be an influencing factor in people’s behavior and choices].


There are methods to this try and reduce the amount of traps the human brain walks into. Simply being aware of the problems and making an effort to look avoid them is one solutions, e.g. consciously give as much weight to first piece of evidence seen as the last piece of evidence.

However, we are all fallible and in large scale reviews avoiding these errors is going to be very difficult to resolve, and possibly expensive in terms of time.

The most obvious solution is automation through concept searching. As previously discussed concept searching can be of great value during a large document review and, like all systems, it will have errors; but we know it’s not susceptible to the human failings discussed in this article.

It doesn’t matter what the system saw first, how strong or visual a document is, or what other reviewers think.  Concept searching will only apply, repeatable, known logic to a group of documents.

[1] As a result of this the Admiral Kimmel was later demoted

[2] This example is lifted directly from the book Irrationality, by Stuart Sutherland

Electronic Discovery: Concept Searching and Languages

Does concept searching handle foreign languages?

Its an important  question for those operating in an international enviroement. Most tools in the electronic discovery industry are driven by the US market and so, unstandably, are US centric .For concept searching tools this means that they tend to strongly favour the English language.

But what if your dealing with a case in Switzerland or Mexico? Will concept searching really work for documents in spanish?

It will depend on the tool and the operator.

Concept searching tools are dependant on how words relate to each other rather than a specific keywords.

Example:  How the term  “house” relates to “buy”,  “pricing” , “client”, or “sale”  within a single a document, could well be more important than the appearance of the  “estate agent” or “realtor” to idenitfy documents relating to house sales.

Relating documents in this way is hard enough, but doing this in multiple languages is even harder.

Some tools, such as Attenx or Recommind, are “pre-programmed” with languages. They “understand” what the term “house” and “sale” mean.  So if both words are in the same document, with a certain frequency, they will be put with other similar documents. “Similar” documents could be files that include the words “appartment” and “sale” or “housing sales”.

This is impressive enough to do in any one language, to do this with multiple languages is even harder. For tools like Attenex this is achieved through brute force, adding more and more languages to the application.

However, tools such as ContentAnalyst, take an entirely different approach. They are not programmed with language, instead they start each project competely blank. They need to be taught a language, from a data set. This is done, on the fly by the users of ContentAnalyst.

This means that as the operator starts loading in data, the tool begins to understand the particular data set. Once the loading has been completed ContentAnalyst will build up a picture of the data, after this concept searching can be conducted. This offers a huge advantage over other tools, as it is completely language indepedant.

The documents could be in English, Swahili or Elvish, it would make no difference to the tool. The tool looks at how words relate to each other, rather than the actual words themselves.

This technology is not only language independant, but industry independant. For example the term “spam” in an IT company means something very different to the same term in a meat company. As ContentAnalyst learns the language for each project, it will pick up on the differences in the use of language, as well as the languages themselves.

Concept Searching

Electronic Discovery: Concept Searching and Error Rates

Assessing the risks and benefits of using conceptual searching technology to cull data, compared with traditional methods of culling data.

As data sizes within companies increase, so do the number of documents available to review. As document sets are now regularly in their hundreds of thousands or even millions reviewing all of these documents is no longer possible; therefore culling of the data has to be conducted. Traditionally this culling has been conducted via the application of keyword searching, de-duplication, and document filtering.

However, due to the sheer scale of documents in modern corporations traditional methods of culling will not always reduce the data to an affordable number of documents. Therefore vendors are increasingly offering new technologies, concept searching and near de-duping, to reduce the data volumes even further. These technologies allow review teams to radically cull documents and complete reviews in time scales that were simply not possible before, but huge volumes of documents are not being reviewed, even though they were collected

What is the risk of the documents not being reviewed and being relevant? Are these tools proportional?


In this working  example data is to be collected, over a period of 1 year, from 21 people. Only email and personal computers are to be collected.

Example 1

  • There are 21 people to be investigated. Each person has a laptop and an unlimited mailbox.
  • A person receives 120 emails per day 5 days a week, 20 days a month, 12 months a year.
  • There is an average of 15,000 files per PC, per person.
  • There are 18 backup tapes per year. 5 daily, 12 monthly, and 1 yearly back up.
  • Users do not delete their email, and therefore the backups contain an exact copy of the live email.
  • 30% of emails sent are between parties within the company and so are duplicates.
  • Keyword filtering reduces the data set by 66%.
  • An member of a review team bills $1000 a day and can review 500 documents a day.
  • There are 5,000 documents and emails, in total, relating to the subject matter.

Data Volume Calculations

Media Source Per Person Entire Data Set
Laptop 15,000 315,000
Email Per Year 144,000 3,024,000
Backup Tapes 2,592,000 54,432,000
Total Data Set 2,751,000 57,771,000

Based on these assumptions, there will be nearly 58 million files, within the available data set. The vast majority of this data will be duplicates. As all of the email captured on the backup tapes (in this scenario) also remains[1] on the live email account, the entire backup set can be filtered out (for calculation purposes), resulting in around 3 million files.

Keyword searching this data, culling at 66%, would leave just over 1 million files.

Further de-duplication across the data set, e.g. the same email between multiple people, would remove 30%, resulting in just over 650,000 files.

The original data set of 58 million documents has been culled to 650,000 documents; this is a 98.9% cull.

Despite the huge cull of data, the cost of an initial review of these documents is expected to be $1,300,000, this is to locate just 5,000 documents.  The costs of hosting and culling the data can be expected to be of a similar order of magnitude.

Errors in Review – Keyword Selection

Despite the huge cost of the review it can be expected that there will be significant errors in the review. In the example above  final cull of data was 99.81%, of this over 2.6 million files were removed either by keywords (chosen by a legal team) or removed during a review (by the legal team).

Keyword choice alone is problematic as highlighted by Mr. Justice Morgan in October 2008 in the DigiCel v Cable and Wireless case (HC07C01917)[2], and by U.S. Magistrate Judge Paul Grimm in Victor Stanley Inc. v. Creative Pipe Inc. 2008 (WL 2221841), who stated[3] that those involved in keyword selection were going where “Angels fear to tread”.

Judge Paul Grimm said that: “Whether search terms or ‘keywords’ will yield the information sought is a complicated question involving the interplay, at least, of the sciences of computer technology, statistics and linguistics…. Given this complexity, for lawyers and judges to dare opine that a certain search term or terms would be more likely to produce information than the terms that were used is truly to go where angels fear to tread. This topic is clearly beyond the ken of a layman.”

Errors in Review – Human Error

In addition to the errors by the selection of keywords errors will be conducted by those reviewing the data.

Measures of human error are difficult to quantify; but one known method is the Human Error Assessment and Reduction Tool, HEART, developed by Williams in 1986. This looks at the variables in a scenario and attempts to predict the error rate. Factors include inexperience, risk misperception, conflict of objectives and low morale, virtually all of which are present, to a significant degree, for a junior member of a review team working on large scale project.

Other studies[4] of human risk perception have shown that people are likely to follow others, even if they are wrong and they know the person they are following is wrong. In addition to this people, develop a mind set to back up their errors and continue in that direction, often repeating that error.

All of this means it is likely that humans will conduct errors within a review

Concept Searching – Theory

Concept searching is the application of technology that looks at the content of documents and attempts to group them together. Different concept searching technology uses different mathematical models to conduct this concept searching including, Bayesian theory, Shannon Information theory and Latent Symantec Indexing. Some concept searching tools are pre-programmed with languages others learn the language from each particular project. Despite the different approaches the tools all have the same purpose, to cull down data and increase the efficiency of the review.

Example 2: Concept searching a mail box

Two employees’ mail boxes contain emails relating to “Project Raptor”, online dating, football matches, expenses, and ‘Project Ocean”.  Concept Searching is applied to the data set and, given that projects Raptor and Ocean relate to very different subjects, the emails would be grouped into the 5 different concepts, known as clusters: Football, Expenses, Dating, Ocean and Raptor.

The grouping, or clustering, of the emails will be conducted regardless of keywords, i.e. an email about a football match, which does not contain the word football or match, will still be placed in the football cluster. I.e. an email stating “You still available for the 5-a-side game next week?” would be placed with the other football related emails

Concept Searching – Application

The application of concept searching allows for documents to be culled down rapidly, by focusing in on the documents known to be relevant or removing the clearly irrelevant data

Example 3: Application of Concept Searching

A review team are trying to locate all information relating to Project Raptor. The data sizes are as described in Example 1.

The 3 million documents that are left following the removal of back up data are not keyword searched but simply de-duplicated across the entire data set, removing 30% of the data. This leaves 2 million files.

The remaining data set consists of the following concepts:

Concept Relevant Percentage (from the 2 million files)
Project Raptor 0.25%
Project Ocean 0.25%
Dating 0.5%
Football 5%
Expenses 5%
Junk Mail 25%
Project Whisper 20%
Project Yellow 1.5%
Job Applications/Interviews 5%
Marketing 15%
Company Events 3%
Other Classification 20%

Using these concepts it would be easy to remove/ignore most of the data, leaving only the following concepts:

Concept Relevant Percentage Number of Files
Project Raptor 0.25% 5000
Other Classification 20% 400,000

The 5,000 documents, which are clearly relevant, can then be reviewed in detail, at a cost of $10,000.  These files have been identified without the clumsy approach of keyword searching.

The remaining 400,000 “other classification” documents could then be reviewed in a different manner.

As the 400,000 are unlikely to be relevant to the case, they could either be ignored or a cursory review of them could be conducted. For example, these documents could be keyword searched; reducing the data set to 132,000 documents and a random sample of 30% of these could be reviewed. The cost of reviewing these additional documents would be approximately $87,000.

This methodology of concept searching would reduce the total cost from $1.3 million to $97,000 a reduction in costs of over 90%.

Concept Searching: The Panacea?

In the hypothetical Example 3 there is a perfect break down of concepts, i.e. exactly 5,000 documents were found in the cluster “Project Raptor” [i.e. the exact number of relevant documents as defined in Example 1]. This is clearly unrealistic[5] and the best that could be hoped for is a cluster of a similar magnitude, but more files, e.g. 7,500 files. This would mean that a review team would have more documents to review than required, but far far less than in the original data set, so they are not missing any documents that could be relevant. i.e the concept searching tool should err on the side of caution.

It can be seen that the reduction in cost is massive, but does concept searching really work and is it reasonable?

Concept Searching tools certainly do reduce down volumes of data, but they are certainly not perfect, even a salesman for a concept searching company would not state they are. But are they a reasonable approach to large data volumes or are there better methods to cull the data down?

The Civil Procedure Rules, Part 31, define how a company must review their data with the emphasis on a “reasonable search”.  Section 31.7 of the Civil Procedure Rules states that:

(2)     The factors relevant in deciding the reasonableness of a search include the following –

(a)     the number of documents involved;
(b)     the nature and complexity of the proceedings;
(c)     the ease and expense of retrieval of any particular document; and
(d)     the significance of any document which is likely to be located during the search

This would imply concept searching methodology would be allowed, and reasonable, as part of a review. This is because it can radically reduce costs of a review, which could otherwise dwarf the value of the case. Section 31.7.2(d) of Civil Procedure Rules specifically refers to the significance that a document is likely to be located. Using concept searching increases the probability that a significant document is going to be found in a given group or cluster of documents and therefore reviewing only those clusters would be reasonable. If nobody is suggesting that a receptionist’s emails are reviewed fraud case involving the CEO and COO, why review the documents in a “football” cluster?

The concern is not about the ability of concept searching tools to cull the documents down, but rather the accuracy of this.

Keyword searching has known flaws; the wrong keywords, incorrectly spelt keywords either by the review team or the custodian expected to be using the term. Keyword searching is also blunt and often produces a high volume of false positives and false negatives. But, what a keyword searching tool is doing is clearly understood and the errors are known.

Concept searching, however, is undocumented, the exact formulas are often kept hidden and there is a lack of understanding of the technology. What makes an email move from one concept to another, if it discusses two different concepts? What’s the error rate? If the same set of “relevant” email is stored in different data sets, will the concept searching tool correctly identify the same number of relevant emails?

Most importantly is the using concept searching reasonable?

In the case of Abela v Hammond Suddards in 2008, the judge stated that “…[It is N]ot that no stone must be left unturned” but that a “reasonable search” is conducted.

This author would believes that concept searching is reasonable in many, but not all, scenarios.

Foot Notes

[1] This level of email storage would almost never occur, even if it did it would be difficult to prove and therefore backup tapes should be considered.

[2] https://whereismydata.wordpress.com/2009/01/25/case-law-keywords/

[3] http://www.insidecounsel.com/News/2008/9/Pages/Where-angels-fear-to-tread.aspx

[4] Exact citation needed. Taken from Politics of Risk and Fear by Dan Gardner

[5] Given the current state of technology, though in the future this may become more realistic.

Electronic Discovery: Is Concept Searching “a reasonble search”?

Its a small sentance, but a big question “Is concept searching a reasonable search?”.

In the UK the process of electronic discovery is defined by the Civil Procedure Rules, Part 31.  31.7 defines the duty of search and states:

1)     When giving standard disclosure, a party is required to make a reasonable search for documents falling within rule

31.7 then goes onto state

(2)     The factors relevant in deciding the reasonableness of a search include the following –

(a)     the number of documents involved;
(b)     the nature and complexity of the proceedings;
(c)     the ease and expense of retrieval of any particular document; and
(d)     the significance of any document which is likely to be located during the search.

Neither “Reasonable” nor “reasonablness” are defined, either in the Civil Procedure Rules or in UK law in general. It is quite literally based what is reasonable. A guide for this “reasonableness” test  is often built upon case law. For example in UK criminal law there is a wealth of case law on what is “reasonable force” to defend yourself.

But there is only one case where the issue of  reasonable search in electronic discovery has been addressed in the UK, and that is the DigiCel case, and this did not relate to concept searching. So what is reasonable?

Firstly, let us address two extreme cases and using this to calibrate our own judgement.

Example 1:  Low number of documents.

A single PST file, with 2000 emails, from a single custodian is provided for review. The case is valued at £10 million.

Is it reasonable to cull this data down using concept searching? This author would suggest not, as that volume of documents could be approached rapidly through a linear review. Concept searching could be used to enhance the review, but not to cull the data [though with such a low number of documents there would be unlikely to be any real benifit]

Example 2: An extremely high number of documents

A file server has 100 million documents on it. These 100 million documents are in general, shared documents, within a massive company. 10 custodians are being investigated and they could have saved their data anywhere in this data set, therefore it has been decided to review this data set for relevant documents. The case is valued at £10 million.

Is it reasonable to search and cull this data set with concept searching? This author would say it was; and this is the reasoning:

100 million documents is a LOT to read, and would cost millions to review, let alone process. In fact reading all of the documents, including the processing, could quickly come to a some which is approaching the £10 million value. I.e it is probably not economically feasible to review the data.  This alone would make the review of all of these documents unreasonable.

Is there any other metric we can use to try and measure the  reasonable of this review? We could use a bit of maths.

Firstly we need to estimate how many document we could find from the custodians; we are going to use high numbers as this will err on the side of caution (this will become clear why shortly).

We will assume a custodian  person creates 10 documents a day for every work day (200 days a year). We will furhter assume that the time of relevance for the particular incident is 5 years. This means that there a maximum of 200*10*5 documents per custodian, or 10,000 documents per custodian.  Which means that there will 100,000 documents in creatd by the custodian in total. We will futher assume that 100% of these document of relevance (a highly over optimistic assumption).

Therefore, out of the 100 million documents 100,000 docments will be relevant. Therefore only 1 in a thousand is likely to be relevant.  If the expected set of documents was a maximum of 10,000 documents then the probability of any one document being relevant is 1 in 10,000. That is a very low number. In reality its highly unlikely that 10 relevant documents are going to be produced a day, the the total number of “relevant documents” could be as low as a few thousand this could bring the probability to 1:50,000;  a very low number.

This author would state that  a 1:1,000 or 1:10,000 chance of finding a document is such a low probability that it would be unreasonable to review of all those documents.

Therefore culling methods have to be used, which would lead to concept searching. To put it another way, if its unreasonable to review the data via a linear method the only reasonable option is to review by non-linear methods.

These calcuations, combined with the pressing issue of costs would lead us to conclude that it is not only reasonble to use concept seaching tools, but required, in this extreme case.

Middle Ground

But what about the middle ground. What about 10 million doument, with 20 custodians? Calcuations should be done on the cost of linear versus concept searching.

What are the expected costs involved and what are the expected benifits? Sampling of data may give an indidcation of the volume of relevant versus non-relevnat documents. If there is a very small sample of relevant documents in a mass of irrelevant documents  then concept searching has to be seriously considered to cull the documents down.

These methods of calcuations are not going to stand up, by a long way, to a staticians enquiry, but they can be used to try gain an idea of the costs involved and the benifits of concept searching versus linear review