Electronic Discovery: Using Keywords to Cull?

The previous article, on concept searching and keywords, ended with the following paragraph:

The use of keywords will not only depend on the case, but the tools and the service being used.  Some companies will apply keywords, and then only those documents responsive to the keyword search will ever be available for review. Other companies/services will apply the keyword search, after the data was loaded into a review platform. These two different service offerings can make a huge difference to the legal strategy, the results, and the costs.

Why such a big difference?

The problems are immediately obvious, once the marketing spiel is stripped away.

Option 1;  This is one option to review data

  • Data is collected “Collected Set
  • Collected data is culled to create a “Review Set“. Culling could involve keyword searching, dates filtering, etc
  • The Review Set is reviewed, this will then produce the “Disclose Set“. Methods to search the data, during the review, include  concept searching, keywords, dates, etc.

This method has a couple of problems, and one benefit. The problems are that key documents could be missed.  No matter how well thought out the keywords are used to create the review set, data will be culled, therefore its possible (and probable) that documents that are of interest are culled out in the huge mass of documents. Keywords change,  date ranges shift, as the undersanding of the case develops. Returning to the example of football (used in the previous two articles),  if a keyword “football” is used, but it later turns out that the key custodians uses the term “footie” rather than football, what will be done then? Re-searching and processing of all of the documents that are not in review?

The benefits of this method (initiall culling with keyword) is simply cost, and nothing else.

Most professional vendors will allow data to be moved between the Collection Set into the review set as the case change, but there will often be an additional cost to this, possibly a significant one. There will be time delays, processing and loading fees, etc, these costs and time delays, can discourage the movement of data into the review set.

Option 2 – A subtle difference.

The idea is simple:  Load all of the data into the review platform, everything except for the files that would never be reviewed in that case (e.g. Movie files, MP3s, etc). All of the culling can then be done within the review platform. This was a decision is not made on what is relevant early on, or what keywords are to be used. All that is removed is the absolute junk.

Data for review can then be moved into an appropriate folder/bucket/location for review, using whatever means required. This “review set” can then be searched, filtered, concept searched, near de-duped etc, by the reviewers as normal. I.e the reviewers will see only the data they need to see, but everything is in the review platform. If/when new terms are identified by the reviewers (dates, words, file types, etc) this can be applied to the entire data set, and the new documents almost instantly made available for review. This is quick to do, as the data is already in the review platform.  If its quick its easy, and therefore should be cheap.

Vendors can’t process that much data!

The obvious statement against Option 2, is that “Vendor can’t process all of the data for a case, hence the initial cull“. But this is not entirely true – depending on the nature of the cull. For example, if the data is to be culled by a keyword then the data must be processed in the first place. The file must be opened, the text extracted, and then the document searched.  Therefore much/all of the processing  work has already been done.

There are reasons that this has not been done, historically. One of these is TIFFing. Data was often needed to be TIFFed prior to loading into a review platform, this is very time consuming and resource intensive. Thankfully, this is no longer required with many review platforms taking away many of the problems that have been seen previously.  Another reason is data space: High quality data storage space was not cheap. So storing 100 GB of data in a data centre was expensive, simply because of the hardware involved, but hardware prices have come down (in line with Moore’s law). This may still be a problem for the giant, multi-terrabyte cases, but for the 10s or 100s of GB this is less likely to be an issue.

There will be occasions when Option 1 will be a better method, or neither Option 1 or 2 are suitbale and an entirely differnet approach is required. But,generally, it’s worth considering all options and not letting the legal stratergy for review being lead by a rigid processing/conveyor belt pricing structure.

Really the question is not if you cull data with keywords, but at what stage should that happen?

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: