+1-631-475-0231 barrister@yannalaw.com

 

Home » Services » e-Discovery; Data Mining & Analytics; Predictive Coding; TAR » Simple keyword Boolean searching is never enough

Simple Boolean keyword searching is never enough

Many attorneys consider themselves competent to draft effective Boolean search queries and then make the leap of faith to the conclusion that their queries are capturing the vast majority of relevant material. Often they do not!

Precision vs. Recall

Keyword advocates fervently hope that by achieving high recall— the percentage of relevant documents found; that precision— the number of relevant versus total documents— also stays high. It doesn’t. There is a tradeoff between recall and precision; the better the recall, the lower the precision. Unfortunately, the opposite is often true: better precision often means worse recall.

When the keywords seem to identify many relevant documents, the search seems precise, but with large datasets, there is almost certain failure to identify many other relevant documents.

Improving keyword searching

While keyword search can be effective in finding relevant documents, it can suffer from both low recall and poor precision.

Conducting broader searches to improve recall comes at the cost of lower precision and requires examining many more irrelevant documents. Total review costs go up accordingly.

Iterating keywords over a series of searches, sampling results and then refining the keyword searches can certainly improve results but only at greatly increased cost.

By relying on computer implemented algorithms, most of them proprietary trade secrets, to bring a desired level of recall while reviewing the fewest possible documents, commercial and academic predictive analytic systems often rely upon a process of weighing document features found through a continuous ranking process. Lawyers, paralegals, and other human beings are generally limited to identifying and locating the documents from which the searches are built.

Basic preparation for keyword searching

If your client is required to produce documents, particularly ESI (electronically stored information), and you choose to use keywords and metadata features to create or limit a data set, at the very least you must statistically sample the data you are proposing to leave behind. Otherwise you have no viable defense against sanctions for overlooking what may be relevant and material documents.

The fatal flaw of Boolean keyword searches

A Boolean search connects keywords with Boolean operators, such as “AND,” “OR,” or “NOT, but only returns matching documents whose text contains the exact words and phrases provided by the user, within the particular conditions specified. Searchers can only find what they already know to look for and the “keywords” they search on often miss many relevant documents without ever providing any warning that relevant documents may have been missed.

In the search for appropriate keywords:

  • Examine the complaint and every other document you have that is relevant to the subject matter of the litigation.
  • It is best to start with broad terms. Recall is much more important than precision so err on the side of over-inclusion.
  • Include word variants or “stems”. Some review platforms have stem search capabilities, but it is best to think of all possible relevant variants and string them together with the Boolean OR operator. Beware of “wildcards,” however, without some kind of preliminary testing for potential overinclusion.
  • Include synonyms for any important terms. Check a good online thesaurus; but for industry-specific terms, check the trade publications, particularly those which have published “style manuals.” Beware of acronymns.
  • Test, revise, and re-test your search terms. First run your searches individually or in small topically related groups. If the results demonstrate a need to revise, change only one element at each retest run. If necessary use well-established statistical sampling methods with robust randomization.
  • Test random samples from both the document set created from keyword hits and the set of all the documents that have been discarded. Then compare the relevance rates of both sets before making permanent discards.

Remember, no matter how carefully you craft your search terms, keyword searching is imprecise. The only way to be sure your searches are sufficiently comprehensive is by testing your results.

Effective Boolean queries must account for synonymy, polysemy, and contextual meaning.

Synonyms

Boolean queries must include all the functional synonyms for a keyword. An online thesaurus is essential in preparing a keyword Boolean document search.

Polysemy

Because words often have multiple meanings, keyword search has poor precision. Highly relevant keywords may return large numbers of irrelevant items, which require complex Boolean restrictions to eliminate the false positives.

This problem is exacerbated by the need to include functional synonyms. Effective Boolean queries are often long and difficult to interpret making proofreading and error checking much less bug fixing difficult and time consuming.
This is particularly true when the keyword can be used as a verb, noun or even an adjective according to its context.

Contextual meaning

Boolean keyword search does not capture contextual meaning. The meaning of a “smoking gun” email is rarely found in the actual words of the message but from the context of those words or the message itself. This is particularly true when wrongdoing of some kind is involved. Contextual analysis requires linking between and among senders and recipients, and the date and time sent and received, as well as the text and context of other documents.

To insure that all possible relevant documents are identified, it is often necessary to include all documents during important date-ranges, or to or from certain email addresses, or that contain relatively general terms, and then search for and remove batches of material which can be easily identified as irrelevant even though eliminating irrelevant data requires a-priori knowledge of the documents.

Because large document data sets often have “many-to- many” rather than “one-to-many,” links and linking expands in complexity exponentially, hyperbolic database management technologies are required in order to evaluate the mass of documents recovered from a comprehensive Boolean search and rank their relevancy.

Email communications, routinely use informal and ad hoc abbreviations, alternate phrasing, colloquialisms, referential language and omissions that rely on context for their meaning, and Boolean searches often miss these words and phrases entirely.

Keyword search is a vital tool. It is extremely effective if you know exactly what you are looking for. Every practitioner involved in electronic document discovery who is considering some kind of keyword Boolean search must at the very minimum speak to or depose the custodians of all the potentially relevant documents and from their statements or testimony build a glossary of terms, abbreviations and other identifiers associated with any of the documents in that database. (William. A. Gross Const. Associates, Inc. v. Am. Mfrs. Mut. Ins. Co., 256 F.R.D. 134, 134 (S.D.N.Y. 2009) (Peck, M.J.) )

Then test the queries you do develop to ensure they properly capture the relevant documents. (Moore v. Publicis Groupe, 287 F.R.D. 182 (2012) (Peck, M.J.). Many if not most e-discovery protocols are built around reaching agreement on keywords, but few of those protocols require testing to see whether the keywords might be missing large numbers of relevant documents.

In re Direct Sw., Inc., Fair Labor Standards Act (FLSA) Litig., 2009 WL 2461716, **1–2 (E.D. La. Aug. 7, 2009) cited Magistrate Judge Peck in Moore for his admonition that his decision, “should serve as a wake-up call to the Bar in this District about the need for careful thought, quality control, testing, and cooperation with opposing counsel in designing search terms or ‘keywords’ to be used to produce emails or other electronically stored information….” and his description of the case as, “[T]he latest example of lawyers designing keyword searches in the dark, by the seat of the pants, without adequate (indeed, here, apparently without any) discussion with those who wrote the emails.”

There are classification tools such as email threading or near-duplicates detection which can help and should be added to the e-discovery toolbox of every litigator.

If the case involves what you as the attorney of record consider a large number of electronic documents and you are concerned about cost-effectively identifying the relevant material, consider speaking to an e-discovery attorney about the benefits of an early data assessment to more accurately estimate the costs of e-discovery, effective ways to eliminate large numbers of irrelevant documents without review, and fashion an effective and reasonable e-discovery plan proportional to the needs of the case.