Electronic Discovery: Using Keywords to Cull?

The previous article, on concept searching and keywords, ended with the following paragraph:

The use of keywords will not only depend on the case, but the tools and the service being used.  Some companies will apply keywords, and then only those documents responsive to the keyword search will ever be available for review. Other companies/services will apply the keyword search, after the data was loaded into a review platform. These two different service offerings can make a huge difference to the legal strategy, the results, and the costs.

Why such a big difference?

The problems are immediately obvious, once the marketing spiel is stripped away.

Option 1;  This is one option to review data

  • Data is collected “Collected Set
  • Collected data is culled to create a “Review Set“. Culling could involve keyword searching, dates filtering, etc
  • The Review Set is reviewed, this will then produce the “Disclose Set“. Methods to search the data, during the review, include  concept searching, keywords, dates, etc.

This method has a couple of problems, and one benefit. The problems are that key documents could be missed.  No matter how well thought out the keywords are used to create the review set, data will be culled, therefore its possible (and probable) that documents that are of interest are culled out in the huge mass of documents. Keywords change,  date ranges shift, as the undersanding of the case develops. Returning to the example of football (used in the previous two articles),  if a keyword “football” is used, but it later turns out that the key custodians uses the term “footie” rather than football, what will be done then? Re-searching and processing of all of the documents that are not in review?

The benefits of this method (initiall culling with keyword) is simply cost, and nothing else.

Most professional vendors will allow data to be moved between the Collection Set into the review set as the case change, but there will often be an additional cost to this, possibly a significant one. There will be time delays, processing and loading fees, etc, these costs and time delays, can discourage the movement of data into the review set.

Option 2 – A subtle difference.

The idea is simple:  Load all of the data into the review platform, everything except for the files that would never be reviewed in that case (e.g. Movie files, MP3s, etc). All of the culling can then be done within the review platform. This was a decision is not made on what is relevant early on, or what keywords are to be used. All that is removed is the absolute junk.

Data for review can then be moved into an appropriate folder/bucket/location for review, using whatever means required. This “review set” can then be searched, filtered, concept searched, near de-duped etc, by the reviewers as normal. I.e the reviewers will see only the data they need to see, but everything is in the review platform. If/when new terms are identified by the reviewers (dates, words, file types, etc) this can be applied to the entire data set, and the new documents almost instantly made available for review. This is quick to do, as the data is already in the review platform.  If its quick its easy, and therefore should be cheap.

Vendors can’t process that much data!

The obvious statement against Option 2, is that “Vendor can’t process all of the data for a case, hence the initial cull“. But this is not entirely true – depending on the nature of the cull. For example, if the data is to be culled by a keyword then the data must be processed in the first place. The file must be opened, the text extracted, and then the document searched.  Therefore much/all of the processing  work has already been done.

There are reasons that this has not been done, historically. One of these is TIFFing. Data was often needed to be TIFFed prior to loading into a review platform, this is very time consuming and resource intensive. Thankfully, this is no longer required with many review platforms taking away many of the problems that have been seen previously.  Another reason is data space: High quality data storage space was not cheap. So storing 100 GB of data in a data centre was expensive, simply because of the hardware involved, but hardware prices have come down (in line with Moore’s law). This may still be a problem for the giant, multi-terrabyte cases, but for the 10s or 100s of GB this is less likely to be an issue.

There will be occasions when Option 1 will be a better method, or neither Option 1 or 2 are suitbale and an entirely differnet approach is required. But,generally, it’s worth considering all options and not letting the legal stratergy for review being lead by a rigid processing/conveyor belt pricing structure.

Really the question is not if you cull data with keywords, but at what stage should that happen?

Electronic Discovery: Concept Searching & Keyword Searching

Keyword Searching & Concept Searching

In the wake of the DigiCel case, and the constant strides in concept searching (not to mention the hefty marketing budget spendton advertising the tools) does this mean that keyword searching is a thing of the past?

In the preceding article, in which concept searching is discussed,  an example of where concept searching would trump keyword searching was used, this same example is discussed below.

Example 1:

“Do you want to watch football tonight, I bet Chelsea scores?”

Example 2

“Do you want to watch the game tonight, I bet Chelsea scores?”

If the term “football” was used as a keyword, then Example 1 would be found, but Example 2 would not.  But they are both clearly about football. Concept searching tool would be expected to “find” both examples, because “Concept Searching” is, as the name implies, actually looking for “concepts” rather than just keywords.

If a keyword search was to be conducted, then the keyword search list could be expanded to include “game”, but then following sentence would not be found

  • Are you going to see the match tonight?

Therefore the word “match could be added. But what about the  sentence:

  • Fancy a game of footy tonight?

There are, of course, numerous combinations in which a football match can be talked about without mentioning the term “football”.

If a keyword search list was going to be used to identify football, then the keyword list would have to be quite comprehensive, and include the following terms.

  • Football
  • Soccer
  • Match
  • Score
  • Game
  • Footy
  • Five a side
  • 5 a side
  • Players
  • League
  • Cup

This list is not guaranteed to bring every email and document about football back, but it is guaranteed to bring back a lot of none relevant material. It is immediately obvious that the term “cup” and “match” could have lots of meanings, other than football. If the document set is a couple of hundred thousand documents, or even a couple of million, then amount of false positives produced will be huge. In this case the keyword search list would not be very effective.

Concept searching  could cluster the documents of a similar nature together and therefore help to find the football related emails.

In this trivial example we can see the failings of keyword searching and the benefits of concept searching, but that does not mean that concept searching has replaced keyword searching; far from it.

The use of keyword searching in a concept based world

Even if concept searching is used, reviewers will still need to find documents within clusters.  Going back to the football example, once the documents have been clustered a keyword search for “football”, could help identify the relevant cluster.

Equally it may be that there are so many documents to start with that an initial cull of documents with a very broad keyword search, could be justified, before going into a concept based review. Taking a document set from 10 million to 1 million documents, could be seen to be seen as an economical way to approach a review. The reverse is also true, where a concept search could be conducted and then a keywords search applied, to help focus in on the core documents quickly.

The use of keywords will not only depend on the case, but the tools and the service being used.  Some companies will apply keywords, and then only those documents responsive to the keyword search will ever be available for review. Other companies/services will apply the keyword search, after the data was loaded into a review platform. These two different service offerings can make a huge difference to the legal strategy, the results, and the costs.

The next article on keywords will discuss how and why.

Electronic Discovery: What is Concept Searching?

There is much discussion about concept searching, its benefits, etc (and there are  quite a few articles on this site about the very subject). But what is it, and how is it deployed?

What is concept searching?

Concept searching is a method of searching files not based on keywords, but on the subject matter of the document, paragraph, or sentence.  This is different to keyword searching which requires an exact keyword hit.

Example 1:

“Do you want to watch football tonight, I bet Chelsea scores?”

Example 2

“Do you want to watch the game tonight, I bet Chelsea scores?”

If the term “football” was used as a keyword, then Example 1 would be found, but Example 2 would not.  But they are both clearly about football. However, a concept searching tool would be expected to “find” both examples, because “Concept Searching” is, as the name implies, actually looking for “concepts” rather than just keywords.

How is Concept Searching Deployed?

Concept Searching can be deployed in a variety of methods, and with many different names, depending on the vendor/service provider/consultant. We will attempt to cover the most common names, and methodologies, below.


This is, perhaps, the most commonly known use of concept searching. It takes a group of data and then breaks it into groups of “similar documents”. E.g one group could be about documents relating to football, another about meetings, etc. The number and size of the documents would depend on the documents, and the concept searching tools being used. For example 10,000 documents, could be broken in 10 groups of 1000 documents each, or 1000 groups of 10 documents each. Equally there could be 1 group of 5,000 and 5 groups of 1,000. There are a near infinite number of combinations. The more advanced tools on the market allow the operator a degree manipulation on the sizing and nature of the groups.

The advantage of these groupings is that it allows the effective focusing of resources. e.g. The groupings that appear to be about football, parties, and other junk material can either be not reviewed, or just scanned quickly. If there are 500 documents in one group, and a sample review of that group shows that they all relate to fantasy football, and all the email titles appear to relate to fantasy football, then it may (case dependant) be reasonable to skip the rest of the group. Equally, if there is a group of 1,000 documents, related to “contracts”, the senior reviewers can be dedicated to the cluster, and a detailed review can be conducted early on in the case. Rather than going through one or two reviewers before those documents get reviewed in detail.

Auto-Tagging/Predictive Marking

This methodology works on the same technology, identification of documents through concepts, but rather than creating multiple groups it will create a couple of groups or possibly only one. Generally a small sample of documents are provided, which are similar to each other, to search agains ta very large number of unknown documents.  The concept searching tool will search the large number of unknown documents for documents which are “similar” to the known document set. Documents that are found to be similar will then be identified and clustered together. This type of technology can be deployed in several different ways, e.g.

  • Looking for documents which are similar to a document already disclosed
  • Looking for documents that are similar to documents that are known to be junk, e.g. if there is a lot of social networking email traffic, this could be used to identify much of the spam and remove it from the review set
  • If  “hot documents” are found during the initial review, these can be used to identify other similar documents.

Concept  Searching “words”

Some concept searching tools allow the searching of words or paragraphs. In these circumstances the tool is doing just what is done above, but on a more focused paragraph or sentence rather than an entire document. This is particularly important when dealing with large documents; if there is a 500 page document that is  important or “hot”, because of one paragraph, concept searching for similar documents will often produce junk. In these cases concept searching for a key paragraph may be more effective.

Keyword Searching, is this still needed?  With the advent of concept searching, should keyword searching still be conducted? This subject will be covered in the next article on concept searching.

Foreniscs Quizzes are on the move

The following quizzes have now been moved to the new site:

A selection of computer forensics tests and quizzes are available the parent site Where Is Your Data?, these include:

Computer Forensics – NTFS (1) (theory)

Computer Forensics – NTFS (2) (theory)

Computer Forensics  – Law (theory)

Who’s Who (in computer forensics and electronic discovery)

The following quizzes have not all been moved across to the new site yet, but are being moved across (all in good time)

Electronic Discovery: Review Platform – eView

Integreon’s eView latest version (Version 3), is discussed below.

The review platform has a clean look to it, unlike other more cluttered platforms, e.g. RingTail, which has very granular detail but can also appear cluttered. In fact eView has an almost “Outlook 2007″  look to it, which is almost certainly deliberate as it will mean that lawyers looking for emails will feel comfortable in that environment.

It has all the standard feature that are expected in a review platform, searching, de-duplication, etc. But it also has message threading as well, which is increasingly common but still not standard in all review platforms.

Like most review platforms it has an SQL back end, but the system is accessed via Citrix, which is less common in the review platform community. Offerings from RingTail, Relativity, Recomind, iConnect, Documatrix, etc, are  all available a web browser (normally its only I.E). The major exception to this being Attenex, which also works via Citrix.

eView looks like a standard linear review platform, but has Content Analyst, a concept searching tool built in. This means that both a linear and non-linear review can be conducted in the same review platform. This puts eView competing directly with Relativity, which has been very well received on both sides of the Atlantic.

The screen shots below shows eView being used to concept search. In this first example a paragraph is being highlighted, so that documents that are similar to that paragraph can be searched for.

Integreon eView Screen Shot Concept Searching

Integreon eView Screen Shot Concept Searching

Integreon eView Screen Shot Concept Searching (2)

Integreon eView Screen Shot Concept Searching (2)

Company Profile: Integreon


Integreon is a US based electronic discovery firm,  with offices around the US, but who also offer services  internationally, including outsourced review, via contract para-legals/attorneys, in several locations including India and the Philippines. Recently Integreon also opened up a data center/Review Center in the UK following a deal with the UK law firm Osborne Clarke

Integreon has grown both organically and via acquisition. The company acquired Datum Legal in the summer of 2008, they then went onto acquire ONSITE3 in April 2009. At the time ONSITE3 was in Chapter 11 bankruptcy, so this would have resulted in an excellent opportunity for Integreon.

Integreon’s now vastly improved review platform, eView,  was based on the ONSITE3 offering (it had been previously know as OnSite Discovery).

Electronic Discovery: Francis Bacon and Concept Searching

Human errors and the human state of mind have a big effect on the decision making process in electronic discovery, but what are these errors?

A few hundred years ago Francis Bacon stated: The human understanding when it has once adopted an opinion (either as being the received opinion or as being agreeable to itself) draws all things else to support and agree with it. And though there be a greater number and weight of instances to be found on the other side, yet these it either neglects and despises, or else by some distinction sets aside and rejects; in order that by this great and pernicious predetermination the authority of its former conclusions may remain inviolate

What Francis Bacon was saying, rather elegantly, is that people stick to their guns. What he believed to be true in the 17th Century, psychologist in the 20th and 21st century have now shown to be true.

Humans Examining Evidence

It has been demonstrated by psychologist that people often not only stick to their beliefs but they seek out evidence to re-enforce their own opinions and reject new evidence which is contrary to their belief. Test after test has shown this, and it can be seen in real life situations, from politicans to generals. In the run up Pearl Harbor there where there were numerous warnings that an attack was about to occur, including a Japanese submarine that was sunk just outside the harbor only 1 hour before the attack.  But the admiral in charge, Admiral Kimmel, had believed that Japan would not attack Perl Harbor, and so ignored the information and deliberately misinterpreted information, intelligence, and warnings – he stuck to his guns[1]. He did not even cancel weekend leave for his staff, or ask if the Army were manning the anti-aircraft guns (this would have only required a single phone call). This is not uncommon and people, of all levels, do this quite regularly. This state of mind effects scientist and politician alike.

Humans making decisions

Not only do humans often stick to incorrect decisions, but we are easily influenced. For example people will often follow something known as the “rule of primacy”. This says, in short, that the first thing people learn about a subject, they take to be true.

Example: If a person is told that Car A is fantastic by a friend of theirs, then they will tend to believe that, even if they are later told that Car A is less than good. In fact they will seek out evidence to support what they already believe.

Another well known cause of errors in humans is the availability error. This means that the stronger the memory, the more powerful the memory, than the more likely people are to make decisions based on that information. This has been shown in labs and the real world. For example, earthquake insurance in areas that have earthquakes increases immediately after a quake but decreases the longer it has been since an earthquake has occurred  – because the memory of the quake fades. However, the probability of quake increases the longer the time between quakes and safest after the quake. I.e. people are buying and not buying insurance at exactly the wrong time. Equally if people are asked to estimate which is the more common, words with beginning[2] with the letter “R” or having “r” as the third letter they will often say the former, as they can immediately think of words beginning with the R. Rain, rainbow, rivet, red, real, reality, etc, But, in fact there are more words with the third letters as “r”, street, care, caring, borrow, etc. However the people have the first letter “R” strongest in their mind, so that is what they believe.

Other well known causes of human errors include:

Peer pressure/conformity. People tend to follow the decisions of others, even when they are quite obviously wrong. There are well known examples tests of this, such as a person being put in a room with 5 or 10 other people and asked to complete simple tests such as say which is the shorter of three lines, or how many beats of a drum there was. The tests are simple; the beats would be easy to count or one of the lines would be obviously shorter, but the test subject would be in a room with 5 or 10 other people who were actors, paid to deliberately give the same wrong answer. If the answers of everybody were read out aloud the test subject would, more often than not, follow the incorrect answers.

Obedience/Management pressure: Following the opinion of their superior (depending on the culture), regardless of if it right is something that can often occur. This was most famously demonstrated in the Stanley Milgram tests where volunteers willingly applied enough voltage to kill other innocent people, simply because they were asked to.

There are many more examples proving these human conditions, and even more conditions that cause us to make errors on a day to day basis – it’s just the nature of the human brain. It is how we work (or don’t).

Electronic Discovery & Psychology

So what has all of this psychology and “soft science” got to do with electronic discovery?

Electronic discovery is historically driven by humans, from keyword selection to the relevance of a document.  It is the errors identified above, and more, that can come into play during a review.

Below are examples of how these well known errors can affect a review:

  • Once a person decides on a keyword search criteria, once they have put their flag in the ground, they are, statistically, unlikely to be willing to change their mind about the value of the search criteria. They may even ignore evidence or documents that could prove otherwise. In fact research has shown that once a person states publically, or commits a decision to writing, they are even more likely to stick to their guns than somebody who makes that decision privately.
  • If a person has to review another 500 page document, and it’s late and they want to go home, then may quickly start to believe that this document is not relevant.  They may start to scan  the document looking for information that demonstrates that the document is not relevant, rather than looking for evidence that shows it is relevant.
  • A second opinion on document’s relevance may be sought, from a senior manager or colleague, and that opinion will then be followed through the review, regardless of it was right or not. Even if the original reviewer believes the opinion to be wrong.
  • If a review platform does not record who made the decision of if a document is relevant or not, then the reviewer may be less inclined to be so diligent, as they are removed from their responsibility by anonymity. [Anonymity has also been shown to be an influencing factor in people’s behavior and choices].


There are methods to this try and reduce the amount of traps the human brain walks into. Simply being aware of the problems and making an effort to look avoid them is one solutions, e.g. consciously give as much weight to first piece of evidence seen as the last piece of evidence.

However, we are all fallible and in large scale reviews avoiding these errors is going to be very difficult to resolve, and possibly expensive in terms of time.

The most obvious solution is automation through concept searching. As previously discussed concept searching can be of great value during a large document review and, like all systems, it will have errors; but we know it’s not susceptible to the human failings discussed in this article.

It doesn’t matter what the system saw first, how strong or visual a document is, or what other reviewers think.  Concept searching will only apply, repeatable, known logic to a group of documents.

[1] As a result of this the Admiral Kimmel was later demoted

[2] This example is lifted directly from the book Irrationality, by Stuart Sutherland


Get every new post delivered to your Inbox.

Join 29 other followers