Electronic Discovery: Concept Searching & Keyword Searching

Keyword Searching & Concept Searching

In the wake of the DigiCel case, and the constant strides in concept searching (not to mention the hefty marketing budget spendton advertising the tools) does this mean that keyword searching is a thing of the past?

In the preceding article, in which concept searching is discussed,  an example of where concept searching would trump keyword searching was used, this same example is discussed below.

Example 1:

“Do you want to watch football tonight, I bet Chelsea scores?”

Example 2

“Do you want to watch the game tonight, I bet Chelsea scores?”

If the term “football” was used as a keyword, then Example 1 would be found, but Example 2 would not.  But they are both clearly about football. Concept searching tool would be expected to “find” both examples, because “Concept Searching” is, as the name implies, actually looking for “concepts” rather than just keywords.

If a keyword search was to be conducted, then the keyword search list could be expanded to include “game”, but then following sentence would not be found

  • Are you going to see the match tonight?

Therefore the word “match could be added. But what about the  sentence:

  • Fancy a game of footy tonight?

There are, of course, numerous combinations in which a football match can be talked about without mentioning the term “football”.

If a keyword search list was going to be used to identify football, then the keyword list would have to be quite comprehensive, and include the following terms.

  • Football
  • Soccer
  • Match
  • Score
  • Game
  • Footy
  • Five a side
  • 5 a side
  • Players
  • League
  • Cup

This list is not guaranteed to bring every email and document about football back, but it is guaranteed to bring back a lot of none relevant material. It is immediately obvious that the term “cup” and “match” could have lots of meanings, other than football. If the document set is a couple of hundred thousand documents, or even a couple of million, then amount of false positives produced will be huge. In this case the keyword search list would not be very effective.

Concept searching  could cluster the documents of a similar nature together and therefore help to find the football related emails.

In this trivial example we can see the failings of keyword searching and the benefits of concept searching, but that does not mean that concept searching has replaced keyword searching; far from it.

The use of keyword searching in a concept based world

Even if concept searching is used, reviewers will still need to find documents within clusters.  Going back to the football example, once the documents have been clustered a keyword search for “football”, could help identify the relevant cluster.

Equally it may be that there are so many documents to start with that an initial cull of documents with a very broad keyword search, could be justified, before going into a concept based review. Taking a document set from 10 million to 1 million documents, could be seen to be seen as an economical way to approach a review. The reverse is also true, where a concept search could be conducted and then a keywords search applied, to help focus in on the core documents quickly.

The use of keywords will not only depend on the case, but the tools and the service being used.  Some companies will apply keywords, and then only those documents responsive to the keyword search will ever be available for review. Other companies/services will apply the keyword search, after the data was loaded into a review platform. These two different service offerings can make a huge difference to the legal strategy, the results, and the costs.

The next article on keywords will discuss how and why.

Electronic Discovery: What is Concept Searching?

There is much discussion about concept searching, its benefits, etc (and there are  quite a few articles on this site about the very subject). But what is it, and how is it deployed?

What is concept searching?

Concept searching is a method of searching files not based on keywords, but on the subject matter of the document, paragraph, or sentence.  This is different to keyword searching which requires an exact keyword hit.

Example 1:

“Do you want to watch football tonight, I bet Chelsea scores?”

Example 2

“Do you want to watch the game tonight, I bet Chelsea scores?”

If the term “football” was used as a keyword, then Example 1 would be found, but Example 2 would not.  But they are both clearly about football. However, a concept searching tool would be expected to “find” both examples, because “Concept Searching” is, as the name implies, actually looking for “concepts” rather than just keywords.

How is Concept Searching Deployed?

Concept Searching can be deployed in a variety of methods, and with many different names, depending on the vendor/service provider/consultant. We will attempt to cover the most common names, and methodologies, below.

Clustering/Buckets

This is, perhaps, the most commonly known use of concept searching. It takes a group of data and then breaks it into groups of “similar documents”. E.g one group could be about documents relating to football, another about meetings, etc. The number and size of the documents would depend on the documents, and the concept searching tools being used. For example 10,000 documents, could be broken in 10 groups of 1000 documents each, or 1000 groups of 10 documents each. Equally there could be 1 group of 5,000 and 5 groups of 1,000. There are a near infinite number of combinations. The more advanced tools on the market allow the operator a degree manipulation on the sizing and nature of the groups.

The advantage of these groupings is that it allows the effective focusing of resources. e.g. The groupings that appear to be about football, parties, and other junk material can either be not reviewed, or just scanned quickly. If there are 500 documents in one group, and a sample review of that group shows that they all relate to fantasy football, and all the email titles appear to relate to fantasy football, then it may (case dependant) be reasonable to skip the rest of the group. Equally, if there is a group of 1,000 documents, related to “contracts”, the senior reviewers can be dedicated to the cluster, and a detailed review can be conducted early on in the case. Rather than going through one or two reviewers before those documents get reviewed in detail.

Auto-Tagging/Predictive Marking

This methodology works on the same technology, identification of documents through concepts, but rather than creating multiple groups it will create a couple of groups or possibly only one. Generally a small sample of documents are provided, which are similar to each other, to search agains ta very large number of unknown documents.  The concept searching tool will search the large number of unknown documents for documents which are “similar” to the known document set. Documents that are found to be similar will then be identified and clustered together. This type of technology can be deployed in several different ways, e.g.

  • Looking for documents which are similar to a document already disclosed
  • Looking for documents that are similar to documents that are known to be junk, e.g. if there is a lot of social networking email traffic, this could be used to identify much of the spam and remove it from the review set
  • If  “hot documents” are found during the initial review, these can be used to identify other similar documents.

Concept  Searching “words”

Some concept searching tools allow the searching of words or paragraphs. In these circumstances the tool is doing just what is done above, but on a more focused paragraph or sentence rather than an entire document. This is particularly important when dealing with large documents; if there is a 500 page document that is  important or “hot”, because of one paragraph, concept searching for similar documents will often produce junk. In these cases concept searching for a key paragraph may be more effective.

Keyword Searching, is this still needed?  With the advent of concept searching, should keyword searching still be conducted? This subject will be covered in the next article on concept searching.

Electronic Discovery: Francis Bacon and Concept Searching

Human errors and the human state of mind have a big effect on the decision making process in electronic discovery, but what are these errors?

A few hundred years ago Francis Bacon stated: The human understanding when it has once adopted an opinion (either as being the received opinion or as being agreeable to itself) draws all things else to support and agree with it. And though there be a greater number and weight of instances to be found on the other side, yet these it either neglects and despises, or else by some distinction sets aside and rejects; in order that by this great and pernicious predetermination the authority of its former conclusions may remain inviolate

What Francis Bacon was saying, rather elegantly, is that people stick to their guns. What he believed to be true in the 17th Century, psychologist in the 20th and 21st century have now shown to be true.

Humans Examining Evidence

It has been demonstrated by psychologist that people often not only stick to their beliefs but they seek out evidence to re-enforce their own opinions and reject new evidence which is contrary to their belief. Test after test has shown this, and it can be seen in real life situations, from politicans to generals. In the run up Pearl Harbor there where there were numerous warnings that an attack was about to occur, including a Japanese submarine that was sunk just outside the harbor only 1 hour before the attack.  But the admiral in charge, Admiral Kimmel, had believed that Japan would not attack Perl Harbor, and so ignored the information and deliberately misinterpreted information, intelligence, and warnings – he stuck to his guns[1]. He did not even cancel weekend leave for his staff, or ask if the Army were manning the anti-aircraft guns (this would have only required a single phone call). This is not uncommon and people, of all levels, do this quite regularly. This state of mind effects scientist and politician alike.

Humans making decisions

Not only do humans often stick to incorrect decisions, but we are easily influenced. For example people will often follow something known as the “rule of primacy”. This says, in short, that the first thing people learn about a subject, they take to be true.

Example: If a person is told that Car A is fantastic by a friend of theirs, then they will tend to believe that, even if they are later told that Car A is less than good. In fact they will seek out evidence to support what they already believe.

Another well known cause of errors in humans is the availability error. This means that the stronger the memory, the more powerful the memory, than the more likely people are to make decisions based on that information. This has been shown in labs and the real world. For example, earthquake insurance in areas that have earthquakes increases immediately after a quake but decreases the longer it has been since an earthquake has occurred  – because the memory of the quake fades. However, the probability of quake increases the longer the time between quakes and safest after the quake. I.e. people are buying and not buying insurance at exactly the wrong time. Equally if people are asked to estimate which is the more common, words with beginning[2] with the letter “R” or having “r” as the third letter they will often say the former, as they can immediately think of words beginning with the R. Rain, rainbow, rivet, red, real, reality, etc, But, in fact there are more words with the third letters as “r”, street, care, caring, borrow, etc. However the people have the first letter “R” strongest in their mind, so that is what they believe.

Other well known causes of human errors include:

Peer pressure/conformity. People tend to follow the decisions of others, even when they are quite obviously wrong. There are well known examples tests of this, such as a person being put in a room with 5 or 10 other people and asked to complete simple tests such as say which is the shorter of three lines, or how many beats of a drum there was. The tests are simple; the beats would be easy to count or one of the lines would be obviously shorter, but the test subject would be in a room with 5 or 10 other people who were actors, paid to deliberately give the same wrong answer. If the answers of everybody were read out aloud the test subject would, more often than not, follow the incorrect answers.

Obedience/Management pressure: Following the opinion of their superior (depending on the culture), regardless of if it right is something that can often occur. This was most famously demonstrated in the Stanley Milgram tests where volunteers willingly applied enough voltage to kill other innocent people, simply because they were asked to.

There are many more examples proving these human conditions, and even more conditions that cause us to make errors on a day to day basis – it’s just the nature of the human brain. It is how we work (or don’t).

Electronic Discovery & Psychology

So what has all of this psychology and “soft science” got to do with electronic discovery?

Electronic discovery is historically driven by humans, from keyword selection to the relevance of a document.  It is the errors identified above, and more, that can come into play during a review.

Below are examples of how these well known errors can affect a review:

  • Once a person decides on a keyword search criteria, once they have put their flag in the ground, they are, statistically, unlikely to be willing to change their mind about the value of the search criteria. They may even ignore evidence or documents that could prove otherwise. In fact research has shown that once a person states publically, or commits a decision to writing, they are even more likely to stick to their guns than somebody who makes that decision privately.
  • If a person has to review another 500 page document, and it’s late and they want to go home, then may quickly start to believe that this document is not relevant.  They may start to scan  the document looking for information that demonstrates that the document is not relevant, rather than looking for evidence that shows it is relevant.
  • A second opinion on document’s relevance may be sought, from a senior manager or colleague, and that opinion will then be followed through the review, regardless of it was right or not. Even if the original reviewer believes the opinion to be wrong.
  • If a review platform does not record who made the decision of if a document is relevant or not, then the reviewer may be less inclined to be so diligent, as they are removed from their responsibility by anonymity. [Anonymity has also been shown to be an influencing factor in people’s behavior and choices].

Solutions?

There are methods to this try and reduce the amount of traps the human brain walks into. Simply being aware of the problems and making an effort to look avoid them is one solutions, e.g. consciously give as much weight to first piece of evidence seen as the last piece of evidence.

However, we are all fallible and in large scale reviews avoiding these errors is going to be very difficult to resolve, and possibly expensive in terms of time.

The most obvious solution is automation through concept searching. As previously discussed concept searching can be of great value during a large document review and, like all systems, it will have errors; but we know it’s not susceptible to the human failings discussed in this article.

It doesn’t matter what the system saw first, how strong or visual a document is, or what other reviewers think.  Concept searching will only apply, repeatable, known logic to a group of documents.


[1] As a result of this the Admiral Kimmel was later demoted

[2] This example is lifted directly from the book Irrationality, by Stuart Sutherland

Electronic Discovery: Concept Searching and Languages

Does concept searching handle foreign languages?

Its an important  question for those operating in an international enviroement. Most tools in the electronic discovery industry are driven by the US market and so, unstandably, are US centric .For concept searching tools this means that they tend to strongly favour the English language.

But what if your dealing with a case in Switzerland or Mexico? Will concept searching really work for documents in spanish?

It will depend on the tool and the operator.

Concept searching tools are dependant on how words relate to each other rather than a specific keywords.

Example:  How the term  “house” relates to “buy”,  “pricing” , “client”, or “sale”  within a single a document, could well be more important than the appearance of the  “estate agent” or “realtor” to idenitfy documents relating to house sales.

Relating documents in this way is hard enough, but doing this in multiple languages is even harder.

Some tools, such as Attenx or Recommind, are “pre-programmed” with languages. They “understand” what the term “house” and “sale” mean.  So if both words are in the same document, with a certain frequency, they will be put with other similar documents. “Similar” documents could be files that include the words “appartment” and “sale” or “housing sales”.

This is impressive enough to do in any one language, to do this with multiple languages is even harder. For tools like Attenex this is achieved through brute force, adding more and more languages to the application.

However, tools such as ContentAnalyst, take an entirely different approach. They are not programmed with language, instead they start each project competely blank. They need to be taught a language, from a data set. This is done, on the fly by the users of ContentAnalyst.

This means that as the operator starts loading in data, the tool begins to understand the particular data set. Once the loading has been completed ContentAnalyst will build up a picture of the data, after this concept searching can be conducted. This offers a huge advantage over other tools, as it is completely language indepedant.

The documents could be in English, Swahili or Elvish, it would make no difference to the tool. The tool looks at how words relate to each other, rather than the actual words themselves.

This technology is not only language independant, but industry independant. For example the term “spam” in an IT company means something very different to the same term in a meat company. As ContentAnalyst learns the language for each project, it will pick up on the differences in the use of language, as well as the languages themselves.

Concept Searching

Follow

Get every new post delivered to your Inbox.

Join 29 other followers