Concept Searching: Better or Worse than a Human?

Concept Searching: Better or Worse than a Human?

To fully answer the question of  “Is a computer better than a human at sorting documents for electronic discovery” would require a significant amount of research and cannot be answered in this article. But, if the question is rephrased slightly to “Is concept searching a better choice than a human for an initial cull of documents?”, then it can be answered more easily. Especially when additional restrictions are put around the question; namely the size of the data sets.

Concept searching only really comes into use with large data sets.  For small data sets, its simply not worth while, financially or technically to consider this technology.

Scale of Data

If there are just 5,000 documents to review it would be recommended that the entire review was conducted by humans. There would be no need for concept searching, and possibly not even keyword searching. A couple of lawyers could plough through that amount of data in a little over a week.

But, if instead of five thousand documents, there are 5 million documents, then the picture changes drastically. Assuming the billing rate for a paralegal/junior lawyer is $1000 a day and they can review 500 documents a day, this means it would cost $10 million to initially review the documents. [These figures are used to gauge an order of magnitude rather than specific values].

Traditionally data sets started from millions of documents would be culled down, by keyword searching and filtering, to a more manageable size. But, increasingly these data sets are being culled down to millions of files, not from.

To Great a Cost

These huge costs put the review of a massive data set out of reach of many projects, due to financial reasons.

For this reason a different approach needs to be taken, and this is where concept searching steps in, with the benefits of clustering and automatic categorization.

The Theory

The idea is that concept searching tools will reduce the data set from millions o a more manageable number, e.g. 500,000. Or, alternatively, allow a review to be conducted more efficiently. Allowing a paralegal to plough through 1,000s of documents a day, rather than just a couple of hundred. Exactly how the concept searching tools are deployed will depend on the consultants involved in the case and the software available to them.

There is no doubt that the concept searching tools can, and do, perform the function of reducing data sets and increasing the speed of the review.

The question here is not how the tools work, or if they reduce data, but if it is a better choice than humans.

The Concern

An article published by FlashBack Data, sums up the concern of many people in the industry, electronic discovery consultants and lawyers alike.

[F]ooling a computer is often easier than fooling a human.  A computer looking for an email regarding illegal stock trades may have a host of keywords: ‘drop’ ‘price’ ‘downturn’ ‘bottom’ ‘loss’ ‘sell’ and so forth,  but it won’t see what you do in this exchange:

“How’s it going today?”
“Everything is the Titanic today, jump ship.”

The message is clear as a bell, but [Concept Searching] software passes right over it, foiled by the lack of human experience that Joe Everyman has

This example is both interesting and convincing.  It neatly catches the essence of the concern of many people, but it’s also inaccurate.

The Reality

Firstly, in the case discussed it would not be financially viable to review all documents manually. Therefore the email above would definitely not be found if nobody reviewed any data, no matter how much human experience they have!

Secondly concept searching tools don’t just work on keywords, they work on concepts and, depending on the tool, they may or may not be preprogrammed with language.  For example, Attenex, which understands language, may place the email with other ship related subject matter. This could be very useful, grouping together other such emails of a similar nature. How many ship related emails does a trading company have that aren’t of this nature?

Thirdly, some tools such as Content Analyst, don’t actually have language programmed into them, they learn the language as required. This is very useful because different words mean different things to different groups of people. For example spam means one thing to meat producers (and Monty Python fans), but another to those who work with email. Therefore learning the language of that particular company, division, or team, can have hugely beneficial effects.

In this example provided by FlashBack “ships” and specifically the titanic are related to dumping stock. Other emails of a similar nature could well make this link. For example an email may say, “This market is not good, probably going to have to sell stock today. I have that sinking feeling… like being on the Titanic”.  This type of language would allow Content Analyst to build up an understanding the language used and bring those files together.

Documents and emails rarely exist by themselves; there will be replies, forwards, and other documents of a similar nature in the document set. All of these help both humans and concept searching tools to pick up a pattern of language, and understand how language. I.e the fact that people are using code words may help, not hinder the concept searching.

Fifthly, humans make errors. Lots of errors. All the time. A person put in a room for 3 months, required to read similar documents, over and over again, on the same screen, with pressure to perform quickly, will make mistakes. Different people will make different judgments on what is or is not relevant, with different error levels. It’s not even a consistent error rate across the entire review.  Concept Searching tools, like any tool, will make errors, but they will be consistent repeatable errors.

Summary

In the same way as de-duplication, keyword searching and electronic review, have become none optional over the past 10 years, concept searching will increasingly become a requirement, simply to deal with the data sizes. The issue of using the concept searching tools or humans to conduct an initially cull will be moot, as costs will enforce the use of these tools.

There is of course a risk to using concept searching, there is a risk to humans reviewing, and there is a risk to choosing keywords and the use of those keywords. It is about assessing those risks.

The discussions around concept searching can be viewed like the discussions around safety equipment on aircraft. The more safety equipment used the safer the flight, better seats, smoke hoods, wider aisles, and more emergency exits, but after a while the costs and the weight of these safety features will prevent the aircraft from taking off. But, we still take planes, despite not having our own personal parachute and emergency exit. It’s a risk, but its one we live and work with.

Using concept searching tools is like flying transatlantic without a parachute, it’s a risk but one we take. It’s a reasonable risk.

Advertisements

One Response to “Concept Searching: Better or Worse than a Human?”

  1. Forensics and Electronic Discovery: Proportionality of Document Review « Data – Where is it? Says:

    […] exist to cull data from keyword searching to near de-duplication and concept searching. A reasoned approach to all of these methods should be taken, not just a blanket yes or no to any […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: