Electronic Discovery: Using Keywords to Cull?

The previous article, on concept searching and keywords, ended with the following paragraph:

The use of keywords will not only depend on the case, but the tools and the service being used.  Some companies will apply keywords, and then only those documents responsive to the keyword search will ever be available for review. Other companies/services will apply the keyword search, after the data was loaded into a review platform. These two different service offerings can make a huge difference to the legal strategy, the results, and the costs.

Why such a big difference?

The problems are immediately obvious, once the marketing spiel is stripped away.

Option 1;  This is one option to review data

  • Data is collected “Collected Set
  • Collected data is culled to create a “Review Set“. Culling could involve keyword searching, dates filtering, etc
  • The Review Set is reviewed, this will then produce the “Disclose Set“. Methods to search the data, during the review, include  concept searching, keywords, dates, etc.

This method has a couple of problems, and one benefit. The problems are that key documents could be missed.  No matter how well thought out the keywords are used to create the review set, data will be culled, therefore its possible (and probable) that documents that are of interest are culled out in the huge mass of documents. Keywords change,  date ranges shift, as the undersanding of the case develops. Returning to the example of football (used in the previous two articles),  if a keyword “football” is used, but it later turns out that the key custodians uses the term “footie” rather than football, what will be done then? Re-searching and processing of all of the documents that are not in review?

The benefits of this method (initiall culling with keyword) is simply cost, and nothing else.

Most professional vendors will allow data to be moved between the Collection Set into the review set as the case change, but there will often be an additional cost to this, possibly a significant one. There will be time delays, processing and loading fees, etc, these costs and time delays, can discourage the movement of data into the review set.

Option 2 – A subtle difference.

The idea is simple:  Load all of the data into the review platform, everything except for the files that would never be reviewed in that case (e.g. Movie files, MP3s, etc). All of the culling can then be done within the review platform. This was a decision is not made on what is relevant early on, or what keywords are to be used. All that is removed is the absolute junk.

Data for review can then be moved into an appropriate folder/bucket/location for review, using whatever means required. This “review set” can then be searched, filtered, concept searched, near de-duped etc, by the reviewers as normal. I.e the reviewers will see only the data they need to see, but everything is in the review platform. If/when new terms are identified by the reviewers (dates, words, file types, etc) this can be applied to the entire data set, and the new documents almost instantly made available for review. This is quick to do, as the data is already in the review platform.  If its quick its easy, and therefore should be cheap.

Vendors can’t process that much data!

The obvious statement against Option 2, is that “Vendor can’t process all of the data for a case, hence the initial cull“. But this is not entirely true – depending on the nature of the cull. For example, if the data is to be culled by a keyword then the data must be processed in the first place. The file must be opened, the text extracted, and then the document searched.  Therefore much/all of the processing  work has already been done.

There are reasons that this has not been done, historically. One of these is TIFFing. Data was often needed to be TIFFed prior to loading into a review platform, this is very time consuming and resource intensive. Thankfully, this is no longer required with many review platforms taking away many of the problems that have been seen previously.  Another reason is data space: High quality data storage space was not cheap. So storing 100 GB of data in a data centre was expensive, simply because of the hardware involved, but hardware prices have come down (in line with Moore’s law). This may still be a problem for the giant, multi-terrabyte cases, but for the 10s or 100s of GB this is less likely to be an issue.

There will be occasions when Option 1 will be a better method, or neither Option 1 or 2 are suitbale and an entirely differnet approach is required. But,generally, it’s worth considering all options and not letting the legal stratergy for review being lead by a rigid processing/conveyor belt pricing structure.

Really the question is not if you cull data with keywords, but at what stage should that happen?

Electronic Discovery: Concept Searching & Keyword Searching

Keyword Searching & Concept Searching

In the wake of the DigiCel case, and the constant strides in concept searching (not to mention the hefty marketing budget spendton advertising the tools) does this mean that keyword searching is a thing of the past?

In the preceding article, in which concept searching is discussed,  an example of where concept searching would trump keyword searching was used, this same example is discussed below.

Example 1:

“Do you want to watch football tonight, I bet Chelsea scores?”

Example 2

“Do you want to watch the game tonight, I bet Chelsea scores?”

If the term “football” was used as a keyword, then Example 1 would be found, but Example 2 would not.  But they are both clearly about football. Concept searching tool would be expected to “find” both examples, because “Concept Searching” is, as the name implies, actually looking for “concepts” rather than just keywords.

If a keyword search was to be conducted, then the keyword search list could be expanded to include “game”, but then following sentence would not be found

  • Are you going to see the match tonight?

Therefore the word “match could be added. But what about the  sentence:

  • Fancy a game of footy tonight?

There are, of course, numerous combinations in which a football match can be talked about without mentioning the term “football”.

If a keyword search list was going to be used to identify football, then the keyword list would have to be quite comprehensive, and include the following terms.

  • Football
  • Soccer
  • Match
  • Score
  • Game
  • Footy
  • Five a side
  • 5 a side
  • Players
  • League
  • Cup

This list is not guaranteed to bring every email and document about football back, but it is guaranteed to bring back a lot of none relevant material. It is immediately obvious that the term “cup” and “match” could have lots of meanings, other than football. If the document set is a couple of hundred thousand documents, or even a couple of million, then amount of false positives produced will be huge. In this case the keyword search list would not be very effective.

Concept searching  could cluster the documents of a similar nature together and therefore help to find the football related emails.

In this trivial example we can see the failings of keyword searching and the benefits of concept searching, but that does not mean that concept searching has replaced keyword searching; far from it.

The use of keyword searching in a concept based world

Even if concept searching is used, reviewers will still need to find documents within clusters.  Going back to the football example, once the documents have been clustered a keyword search for “football”, could help identify the relevant cluster.

Equally it may be that there are so many documents to start with that an initial cull of documents with a very broad keyword search, could be justified, before going into a concept based review. Taking a document set from 10 million to 1 million documents, could be seen to be seen as an economical way to approach a review. The reverse is also true, where a concept search could be conducted and then a keywords search applied, to help focus in on the core documents quickly.

The use of keywords will not only depend on the case, but the tools and the service being used.  Some companies will apply keywords, and then only those documents responsive to the keyword search will ever be available for review. Other companies/services will apply the keyword search, after the data was loaded into a review platform. These two different service offerings can make a huge difference to the legal strategy, the results, and the costs.

The next article on keywords will discuss how and why.

Electronic Discovery: What is Concept Searching?

There is much discussion about concept searching, its benefits, etc (and there are  quite a few articles on this site about the very subject). But what is it, and how is it deployed?

What is concept searching?

Concept searching is a method of searching files not based on keywords, but on the subject matter of the document, paragraph, or sentence.  This is different to keyword searching which requires an exact keyword hit.

Example 1:

“Do you want to watch football tonight, I bet Chelsea scores?”

Example 2

“Do you want to watch the game tonight, I bet Chelsea scores?”

If the term “football” was used as a keyword, then Example 1 would be found, but Example 2 would not.  But they are both clearly about football. However, a concept searching tool would be expected to “find” both examples, because “Concept Searching” is, as the name implies, actually looking for “concepts” rather than just keywords.

How is Concept Searching Deployed?

Concept Searching can be deployed in a variety of methods, and with many different names, depending on the vendor/service provider/consultant. We will attempt to cover the most common names, and methodologies, below.

Clustering/Buckets

This is, perhaps, the most commonly known use of concept searching. It takes a group of data and then breaks it into groups of “similar documents”. E.g one group could be about documents relating to football, another about meetings, etc. The number and size of the documents would depend on the documents, and the concept searching tools being used. For example 10,000 documents, could be broken in 10 groups of 1000 documents each, or 1000 groups of 10 documents each. Equally there could be 1 group of 5,000 and 5 groups of 1,000. There are a near infinite number of combinations. The more advanced tools on the market allow the operator a degree manipulation on the sizing and nature of the groups.

The advantage of these groupings is that it allows the effective focusing of resources. e.g. The groupings that appear to be about football, parties, and other junk material can either be not reviewed, or just scanned quickly. If there are 500 documents in one group, and a sample review of that group shows that they all relate to fantasy football, and all the email titles appear to relate to fantasy football, then it may (case dependant) be reasonable to skip the rest of the group. Equally, if there is a group of 1,000 documents, related to “contracts”, the senior reviewers can be dedicated to the cluster, and a detailed review can be conducted early on in the case. Rather than going through one or two reviewers before those documents get reviewed in detail.

Auto-Tagging/Predictive Marking

This methodology works on the same technology, identification of documents through concepts, but rather than creating multiple groups it will create a couple of groups or possibly only one. Generally a small sample of documents are provided, which are similar to each other, to search agains ta very large number of unknown documents.  The concept searching tool will search the large number of unknown documents for documents which are “similar” to the known document set. Documents that are found to be similar will then be identified and clustered together. This type of technology can be deployed in several different ways, e.g.

  • Looking for documents which are similar to a document already disclosed
  • Looking for documents that are similar to documents that are known to be junk, e.g. if there is a lot of social networking email traffic, this could be used to identify much of the spam and remove it from the review set
  • If  “hot documents” are found during the initial review, these can be used to identify other similar documents.

Concept  Searching “words”

Some concept searching tools allow the searching of words or paragraphs. In these circumstances the tool is doing just what is done above, but on a more focused paragraph or sentence rather than an entire document. This is particularly important when dealing with large documents; if there is a 500 page document that is  important or “hot”, because of one paragraph, concept searching for similar documents will often produce junk. In these cases concept searching for a key paragraph may be more effective.

Keyword Searching, is this still needed?  With the advent of concept searching, should keyword searching still be conducted? This subject will be covered in the next article on concept searching.

Foreniscs Quizzes are on the move

The following quizzes have now been moved to the new site:

A selection of computer forensics tests and quizzes are available the parent site Where Is Your Data?, these include:

Computer Forensics – NTFS (1) (theory)

Computer Forensics – NTFS (2) (theory)

Computer Forensics  – Law (theory)

Who’s Who (in computer forensics and electronic discovery)

The following quizzes have not all been moved across to the new site yet, but are being moved across (all in good time)

Electronic Discovery: Review Platform – eView

Integreon’s eView latest version (Version 3), is discussed below.

The review platform has a clean look to it, unlike other more cluttered platforms, e.g. RingTail, which has very granular detail but can also appear cluttered. In fact eView has an almost “Outlook 2007″  look to it, which is almost certainly deliberate as it will mean that lawyers looking for emails will feel comfortable in that environment.

It has all the standard feature that are expected in a review platform, searching, de-duplication, etc. But it also has message threading as well, which is increasingly common but still not standard in all review platforms.

Like most review platforms it has an SQL back end, but the system is accessed via Citrix, which is less common in the review platform community. Offerings from RingTail, Relativity, Recomind, iConnect, Documatrix, etc, are  all available a web browser (normally its only I.E). The major exception to this being Attenex, which also works via Citrix.

eView looks like a standard linear review platform, but has Content Analyst, a concept searching tool built in. This means that both a linear and non-linear review can be conducted in the same review platform. This puts eView competing directly with Relativity, which has been very well received on both sides of the Atlantic.

The screen shots below shows eView being used to concept search. In this first example a paragraph is being highlighted, so that documents that are similar to that paragraph can be searched for.

Integreon eView Screen Shot Concept Searching

Integreon eView Screen Shot Concept Searching

Integreon eView Screen Shot Concept Searching (2)

Integreon eView Screen Shot Concept Searching (2)

Follow

Get every new post delivered to your Inbox.

Join 29 other followers