Electronic Discovery: Concept Searching and Languages

Does concept searching handle foreign languages?

Its an important  question for those operating in an international enviroement. Most tools in the electronic discovery industry are driven by the US market and so, unstandably, are US centric .For concept searching tools this means that they tend to strongly favour the English language.

But what if your dealing with a case in Switzerland or Mexico? Will concept searching really work for documents in spanish?

It will depend on the tool and the operator.

Concept searching tools are dependant on how words relate to each other rather than a specific keywords.

Example:  How the term  “house” relates to “buy”,  “pricing” , “client”, or “sale”  within a single a document, could well be more important than the appearance of the  “estate agent” or “realtor” to idenitfy documents relating to house sales.

Relating documents in this way is hard enough, but doing this in multiple languages is even harder.

Some tools, such as Attenx or Recommind, are “pre-programmed” with languages. They “understand” what the term “house” and “sale” mean.  So if both words are in the same document, with a certain frequency, they will be put with other similar documents. “Similar” documents could be files that include the words “appartment” and “sale” or “housing sales”.

This is impressive enough to do in any one language, to do this with multiple languages is even harder. For tools like Attenex this is achieved through brute force, adding more and more languages to the application.

However, tools such as ContentAnalyst, take an entirely different approach. They are not programmed with language, instead they start each project competely blank. They need to be taught a language, from a data set. This is done, on the fly by the users of ContentAnalyst.

This means that as the operator starts loading in data, the tool begins to understand the particular data set. Once the loading has been completed ContentAnalyst will build up a picture of the data, after this concept searching can be conducted. This offers a huge advantage over other tools, as it is completely language indepedant.

The documents could be in English, Swahili or Elvish, it would make no difference to the tool. The tool looks at how words relate to each other, rather than the actual words themselves.

This technology is not only language independant, but industry independant. For example the term “spam” in an IT company means something very different to the same term in a meat company. As ContentAnalyst learns the language for each project, it will pick up on the differences in the use of language, as well as the languages themselves.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: