Keywords selection and keyword lists. Previously these were subjects that lawyers and clients of electronic discovery vendors were pretty confident about. With a decade of vendors and consultants explaining the benefits of keywords, how they can find key documents and remove the junk data with the careful application of keywords, it is not surprising.
Until 2008/2009 it would be a rare e-discovery sales pitch that explained the negatives of keyword searching. But, after the discussion of the DigiCel case and the growth, and marketing, of concept searching technology the perceived knowledge of keywords is now being challenged.
There are some major pit falls in the use of keyword searching, and these are covered below. Most of these relate to poor advice or poor communications, by vendors and consultants. The benefits are keyword searching are well known, so are not covered here.
Searching in Forensics and Electronic Discovery.
If a lawyer gives an electronic discovery vendor a list of keywords they expect to receive back every document that contains that keyword and those documents in the “family” e.g. attachments of emails. That’s a reasonable assumption.
If a lawyers gives a computer forensics vendor the same list of keywords, they expect the same.
In both cases the client may not get what they wanted, for technical reasons, but be completely unware of this. Here is why:
Electronic Discovery Searches
Electronic Discovery searches consist of searching files and emails. What is not included in these searches are files that are deleted, lost, fragmented, corrupted, or encrypted. If a file is encrypted it is not possible to search it without decrypting it first. Therefore you cannot know the files that are most relevant for decryption (based on a keyword list) until after they have been decrypted (it’s a bit of a chicken and egg scenario).
Some companies don’t decrypt documents without being requested to, equally some do. As long as the legal team knows what is happening it should not matter, but how transparent is that communication?
Equally if a file is corrupted or damaged it will be skipped, and thrown up as an error during the ED processing. This is the nature of ED processing, it has lots of errors and lots of files are skipped.
Another problem is that keywords searches in electronic discovery will skip certain file types, because those files cannot be interpreted by a review platform. Forensic technology such as EnCase is required to analyze those files.
For example there is a file called INDEX.DAT on every Windows PC, and this can reveal a history of Google searches. Its not unheard of to find people who have searched for subjects such as
- “wipe data”
- “How to wipe a hard drive”
- “Evidence removal”
- “Evidence deletion”
If a case is related to fraud, or there is a concern that evidence has been deliberately lost/wiped, this could be important. But the ED searches, even if the correct words were used, would not find this evidence, however forensics searches should/would.
With the issues of corrupted, lost, deleted, encrypted, and skipped files, it should be considered that some keywords are more equal than others.
For example there may be a list of keywords given to the vendor/consultants to process, some of which may be generic words; others may be very specific project names or bank account numbers. For this reason it may be worth considering a forensic approach to looking for the specific, priority, keywords.
This type of requests will probably be granted by the vendor, though it would be through gritted teeth, as this type of request deviates from the process, from the conveyor belt of ED processing. It does not mean that it’s not a good idea or not in the best interest in justice.
There are a couple of buyers of electronic discovery services in the UK who consistently makes “non-standard” demands of the ED vendors. What those individuals ask for is hard, is difficult, and doesn’t follow the standard process. It can be quiet annoying to get the data into that format or conduct those searches. But it’s right. You can’t deny the logic of their requests.
Computer Forensics Searches
Computer forensics vendors vary hugely in scale and quality, perhaps more so than ED companies. One possible reason for this the costs of set up. To start an electronic discovery company involves considerable capital, software, hardware and staff. To set up a computer forensics company involves the purchase of EnCase, for around £2000, and that’s about it.
In addition to the costs barriers, computer forensics training and background can differ greatly (particularly in the UK) from that of electronic discovery staff. The net result is that the same request given to a pure ED company can be treated differently to a pure CF company.
It should be emphasized that there are many truly gifted forensics people out there, who can pull together lost, hidden, or corrupted data on a shoe string, far better than a multinational company can with millions of dollars at their disposal.
If a keyword search is given to company that uses EnCase as their primary searching tool, the most popular forensics tool in the world, the client may not get quite what they wanted; particularly if they are used to electronic discovery companies and their deliveries.
EnCase (currently) approaches keyword searches in a different method to that of ED tools, and therefore can produce a far higher false hit rate. For example if the name “eric” is searched for, across the entire hard, some of the “false” hits that would be found include:
Secondly there could be “hits” in obscure places, such as “file slack”. Sometimes, when dealing with a very technical investigation file slack is important, but on a mass scale it’s will appear as junk. Some firms (a minority) will produce this data to a client as though it’s the same quality as a Word document or an email. This will effectively give data to the client that they cannot see, read, or correctly interpret.
The next issue is that EnCase does not handle corporate emails very well. EnCase is brilliants at emails like AOL, Hotmail, Yahoo!, etc, but it’s not so good with Outlook or Lotus Notes files, where the vast majority of corporate communication data is stored.[Version 6 of EnCase is better, but it still does not maintain “families” of documents during export]. This could mean that keywords searches of documents are missed as they are inside emails or that a critical attachment is not produced.
Finally, EnCase does not have de-duplication tools comparable to an ED vendor. Encase can remove e-files that are identical but not emails. This, dependant on the scale of data, can cause problems.
These issues may come as a surprise to some, as EnCase is often considered the gold standard of investigations. It’s not that there is anything wrong with EnCase, it’s just that sometimes it’s not the right tool for the job. If you want an investigation into a computer it is one of the best tools on the market, but in its standard release, it’s not ideal for electronic discovery
Tool selection and tool usage by a vendor/consultant is not something that a lawyer will ever get into, yet the tools being used can radically affect the accuracy, quality, and nature of data they receive from a keyword list.
None of this means that these issues will occur, only that they can occur.
Keyword Selection: To Advise or Not To Advise
First it was the technical bit, now the advice part. Opinions on this section will depend on an individual’s background and experience.
This author strongly believes that it is the role of the consultant to advise on keyword selection; in fact that’s what clients are paying good money for, advice. Other may believe that it’s not the role of the techies to advise the lawyers.
Example 1 – Electronic Discovery Searches
A client provides an ED consultant with a chunk of data and a list of keywords. The list is shown below.
- Thomas Jones
- William Banner
- David Beckham
The ED consultant has two options in front of him. Process the data with the list, as it’s given to him or ask the purpose of the list. If the consultant chooses the later, and the client states “it’s because we have to, it’s a court order” then the company can process the data without further discussion.
Alternatively, if the consultant finds out that the keyword list is chosen by the client, rather than the court, because the client is trying to find all communications with these individuals, then the consultant can work with client to improve the keyword list.
In this case a few questions from the client may reveal that:
- Mr. Banner and Mr. Jones work for Acme
- Mr. Thomas Jones has the email address email@example.com, is referred to as Tom, and has the user name tjones23
- Mr. William Banner has an email address of firstname.lastname@example.org, is referred to as Bill by his colleagues and friends, and has a user name of wbanner.
- Mr. David Beckham, however, does not work for Acme. But did communicate with Mr. Jones and Mr. Banner via his hotmail address email@example.com. Tom and William referred to him as “Dave” not David.
This would mean that a new, improved keyword list could be:
- Thomas Jones
- William Banner
- David Beckham
- Bill Banner
“Bill” should be considered as a keyword, but may produce a high false hit rate. Therefore it should be tested to determine the number of files it does produce and the benefits of using it.
As people rarely refer to each other by their full name e.g. “David Beckham” (other than in formal documents) those keywords/phrases are not often useful alone; but it does not mean that they should not be included as part of an improved keyword list.
The examples reflect real cases where the client has asked that the full names were searched for. As this has happened on several cases, by different firms, over several years, it’s reasonable to assume that this is not an isolated problem.
If a law firm asks for these types of searches to be done, and the consultants do not ask why they are being used then the consultants have failed their client. Advisors should not wait to be asked to advise their clients, by their clients, they should be pro-active in their advice.
Example 2 – Computer Forensic Searches
The problem below is a more technical issue, and one that different companies will have different approaches to.
When clients have had their data stolen, sometimes they will simply pass the suspects hard drive to the relevant company for investigation, along with the obligatory keyword list, and ask that the keywords are looked for.
If the client insists that this searches are done, and nothing more, then forensics company has done all they can to provide a quality services, but if the computer forensics consultants just run the keywords without advice to the client, then they have again failed their client.
Data theft investigations, despite their frequency, are complicated, hard to do and often produce inclusive evidence. Often the investigator is looking for information about files moving from computer A to storage media C, via Computer B. The investigator only has access to computer B, and the storage media C has long since gone (or may never have existed). As such the “stolen” data may never have actually resided on the computer being investigating.
That does not mean that the investigation cannot be completed or successful; only that keywords are quite possibly useless.
The key evidence could be:
A file path that is encrypted in the registry; a file name stored as Unicode in a Link file; the a USB connection time in the SETUPAI.LOG, or other similarly obscure sounding places. This is where the true skill of the forensic investigator come in. Keywords may help, but they are unlikely to provide the solution (in this case).
Some companies (not all, and hopefully the minority) will approach data theft and other similar cases with the same methodology each time. Get a keyword list from the client, run it over the data, give the results to the client, and let them find the evidence they are looking for, a bit like an electronic discovery review.
Other forensic companies will take on a case from a law firm; ask their client what their problem is, then look for solutions and chose the best method(s) for that particular case; one that may not even involve keyword searches.
The approach of each company is not dependant on their size, or the price you pay
Keyword searches can be very beneficial and will be for the conceivable future. But a deeper understanding of the effects of a particular keyword searching, with a particular technology, should be sought before applying a list of words, to a chunk of data, at the beginning of a case. The responsibility to educate the client, on the pros and cons of different methodology for that particular case, rests with the electronic discovery vendors.
This is important because tool selection, tool usage, and technical advice early in the case, radically effects the results a client will get and how the documents are culled down.
That is not to say that law firms have to understand the difference between an EnCase GREP search and an FTK indexed search, but rather that their requirements and expectations for a search should be given to an electronic discovery company and that company should work to develop a search, that meets those criteria as close as possible and inform the client of known failings of the methods they are using [all methods have some flaws].
As mentioned before, it is the law firm who pays the price for decisions on behalf of their client in court. Even if they should not be expected to understand the fine detail of forensic tool selection or the detailed variables of keyword selection, they are the ones who have to face the judge. It is the role of the ED/CF consultant to understand the technical wizadary, and convey that to their client, warts and all.
The only real conclusion from this is the clients of electronic discovery need to approach each case with cautio, and recognize that, perhaps, one company does not fit all scenarios.
 EnCase 6 has an “indexing” function to stop this, but it does not work effectively, and even Guidance, the makers of EnCase except that it will not work correctly until Version 7. There are advanced GREP searches that can be done, to limit the problem of finding “eric” in “amERICa” however, they are not always done. This problem does not exist if tools like DTSearch or FTK are used.
 It has a bigger brother, called EnCase Enterprise, which is far more capable. This is several orders of magnitude more expensive than the standard EnCase.
 The use of the email address as keywords may not be useful depending on the data and the tools being used. If the data being looked is the ACME emails, those searches will pull back every email from Jones and Banner. A better search could be to look for communication between the individuals building up searches based on the TO, FROM, CC, BCC fields, as well as looking in the body of the email for hotmail address, to pick up emails that have been forwarded from one of the parties