Wednesday, May 8, 2013

Federal court approves pre-predictive coding keyword filtration based on faulty math in In re Biomet

A district court’s recent approval of keyword filtration prior to the use of predictive coding in In re Biomet, No. 3:12-MD-2391 (N.D. Ind. April 18, 2013) rests on bad math and could deprive the requesting party of over 80% of the relevant documents. Specifically, the court ruled that a defendant’s use of predictive coding on a keyword-culled dataset met its discovery obligations because only a “modest” number of documents would be excluded. But a proper analysis of the statistical sampling on which the court relied shows that defendant’s keyword filtration would deprive plaintiffs of a substantial proportion of the relevant documents. The error in the court’s finding regarding the completeness of defendant’s production underpinned and undermines its additional holding that to require the defendant to employ predictive coding on the full dataset would offend Rule 26(b)(2)(C) proportionality. Accordingly, the early chorus of praise which has greeted the decision is unwarranted.

Background and holding

The defendant faced numerous products liability lawsuits that were subject to MDL consolidation. Plaintiffs cautioned defendant not to proceed with document production until the cases were consolidated , but defendant proceeded anyway. Thus, prior to consolidation, defendant collected 19.5 million documents, then reduced that collection to 3.9 million documents via keyword searching (2.5 million after deduplication). Defendant then employed predictive coding on this keyword-hit population to make its production. Defendant excluded the non-keyword-hit population of 15.6 million documents from its review and production.

The overall population of 19.5 million documents had a prevalence (percentage of responsive documents) of 1.37% to 2.47%. The non-keyword-hit population of 15.6 million documents had a prevalence of 0.55% to 1.33%. (All prevalence ranges at the 99% confidence level).

The MDL plaintiffs’ steering committee challenged the sufficiency of the defendant's production, arguing that the keywords were insufficient and untested, and that a much larger volume of documents should have been produced. The steering committee sought to have defendant apply predictive coding to the entire dataset of 19.5 million documents, and to participate collaboratively in the predictive coding process.

The MDL judge, Robert Miller, denied the challenge. He acknowledged that predictively coding the entire dataset might yield additional responsive documents but, in sole reliance on the figures above, concluded that only "a comparatively modest number of documents would be found." He also found that it would cost defendant "a million, or millions, of dollars to test [plaintiffs'] theory that predictive coding would produce a significantly greater number of relevant documents." On this basis, the court ruled that the additional discovery sought failed the proportionality test of Rule 26(b)(2)(C), and declined to order it unless plaintiffs bore the expense. The court also held open the possibility that additional keyword-search terms, and production of non-privileged documents included in defendant's sampling, could be negotiated through meet-and-confer.

The court’s holding excludes a potentially sizable proportion of relevant documents

The court grounded its decision on the conclusion that in the non-keyword-hit population "only a modest number of documents would be found." But that conclusion is belied by the figures relied on by the court. The number of relevant documents in the non-keyword-hit set comprise anywhere from approximately 20% to 80% of the relevant documents in defendant's possession. In other words, the court's ruling could effectively exclude well over 80% of the relevant documents, and capped recall at a theoretical maximum just over 80%.

The vital statistics are as follows:
  • 19.5 million document population, with prevalence of 1.37% to 2.47%, for a range of 267,150 to 481,650 relevant documents.
  • Culled to 3.9 million documents by keyword-search (culled to 2.5M unique documents by deduping, a point tangential to the analysis here).
  • Thereby omitting 15.6 million non-keyword-hit documents, with prevalence of 0.55 to 1.33%, or a range of 85,800 - 207,480 relevant documents.
These figures readily indicate best- and worst-case recall. (Quick refresher: recall, the most important measure of accuracy, means the percentage of relevant documents that are actually produced. Precision, the other significant accuracy measure, means the percentage of documents produced that are actually relevant. Precision is not important to the analysis here).
  • Best-case recall: Assume the largest number of total relevant documents, and the smallest number of relevant documents was omitted.
    • Largest number of relevant documents = 2.47% * 19.5M = 481,650
    • Smallest number of omitted relevant documents = 0.55% * 15.6M = 85,800
    • Number of produced relevant documents = 481,650 - 85,800 = 395,850
    • Percentage of relevant documents produced = 395,850 / 481,650 = 82% recall
  • Worst-case recall: Assume the smallest number of total relevant documents, and the largest number of relevant documents was omitted:
    • Smallest number of relevant documents = 1.37% * 19.5M = 267,150
    • Largest number of omitted relevant documents = 1.33% * 15.6M = 207,480
    • Number of produced relevant documents = 267,150 - 207,480 = 59,670
    • Percentage of relevant documents produced = 59,670 / 267,150 = 22% recall
Seen another way, in the worst-case scenario, the defendant's production contains fewer relevant documents than would a random selection of one-fourth of the document population (which would yield 25% recall). It is in line with the renowned Blair and Maron study indicating that keyword search yields recall of approximately 20%.

The truth lies somewhere between those two points. Taking the midpoint, or 52% recall, would imply that defendant produced only slightly more than half of the relevant documents. It is difficult to reconcile this result with the court's conclusion that the number of documents omitted from defendant's production is insubstantial.

Bad to worse

Even those dismal recall figures overstate the proportion of relevant documents produced, and place the court's finding still more at odds with the numbers on which it relies. The foregoing figures assume perfect recall amongst the 3.9 million keyword-hit documents on which defendant performed predictive coding. That assumption is implausible. The court, apparently relying on plaintiffs' argument in favor of predictive coding, hypothesized recall of 75% amongst the keyword-hit documents. (The court also noted that plaintiffs, without evidence, also posited a recall as high as 95% in advocating its use on the entire document population).

Based on the court's 75% predictive-coding recall figure, the recall estimates must be adjusted downward:
  • Best-case: 82% * 75% = 61.5% recall
  • Worst-case: 22% * 75% = 16.5% recall
  • Midpoint: 39% recall
Factoring in the court's posited 75% recall figure, even the best-case scenario omits almost 40% of the relevant documents. The worst-case scenario omits almost 85%. Even these figures could overstate recall when one factors in margin of error, as discussed below.

In light of the recall levels readily attainable, and measurable, using predictive coding, these recall levels seem unnecessarily low and potentially perilous to justice. Deprivation of so large a proportion of the relevant documents could prevent a party from making its case.

The pitfalls of error-rate analysis

As the foregoing analysis shows, the court's conclusion clashes with the underlying figures on which it relies. The plaintiffs' brief focused on the deficiencies of defendant's keyword search and of keyword search generally, not on the mathematical analysis above. The mathematical argument could have been made with greater cogency and emphasis, but lawyering is not to blame. (Click to see plaintiffs' and defendant's briefs).

The court may have been beguiled by the low prevalence of relevant documents in the non-keyword-hit population, eliding the substantial proportion of all relevant documents which they comprised. The mistake, which might be called the "low-prevalence illusion" and is akin to what statisticians call “Simpson's paradox,” is not uncommon in the e-discovery community: relying on error rate as a measure of accuracy, rather than focusing on recall and precision, which better measure the quality of the production. The problem is that low prevalence makes it is trivially easy to achieve a low error rate: just produce a small proportion of the documents. For example, imagine a litigant with an initial document population of 1 million documents with 5% prevalence--that is, 5% or 50,000 of the documents in the collection are responsive. The party produces only half, or 25,000, of the 50,000 responsive documents. The party has achieved an error rate of only 2.5%, for 97.5% "accuracy." Even if the party produces no documents whatsoever, it has achieved a 5% error rate, for an impressive-sounding 95% "accuracy" rate. The party could produce nothing but 25,000 non-responsive documents and achieve a 7.5% error rate (92.5% “accuracy”), or 50,000 non-responsive documents for a 10% error rate (90% "accuracy").

Similarly, in In re Biomet, by omitting all documents from the non-keyword-hit set, defendant was able to achieve a seemingly trivial error rate amongst those documents of between half a percent to one-and-a-third percent--or an "accuracy" rate in the range of 98-99%. Judge Miller apparently relied on these impressive-sounding figures, without considering that even a small error rate could render the production incomplete, given the low prevalence.

The low-prevalence illusion highlights the importance of examining recall (and precision) rather than error rate in assessing the sufficiency of a production. We have heard from multiple firms that their predictive-coding vendors measure accuracy by error rate. As demonstrated above, such accuracy measurement can be highly misleading with potentially dire consequences for the quality of a production.


The court approved a production that was at best incomplete and at worst gravely defective, possibly through innumeracy. This conclusion about the sufficiency of defendant’s production flows not from abstract reference to academic studies or a general aversion to keyword filtration, but from the facts sub judice. That holding is questionable in light of the flawed understanding of the potential magnitude of omitted relevant documents on which it is based.

Additional thoughts

A few additional points:
  • Recall. The court, like the one in Global Aerospace, Inc. v. Landow Aviation, L.P., No. CL 61040 (Loudoun County, Va. Circuit Court April 23, 2012) tolerated a recall goal of only 75%. Nor did the requesting party object on that score. Given the power of predictive coding, and the ease of measuring accuracy to within fairly reasonably bounds, the adequacy of productions with such low recall may be questionable even given proportionality strictures. Higher recall targets should be reasonably achievable. In the other case explicitly approving of predictive coding, Da Silva Moore v. Publicis Group SA, No. 11-cv-1279 (S.D.N.Y. February 22, 2012), Judge Peck approved an ESI protocol with no recall goal, leaving that to the parties to negotiate.
  • Dearth of precision analysis. The absence of any discussion whatsoever of precision in In re Biomet or the other predictive-coding decisions is striking. This critical concept is scarcely mentioned in the briefs, and not at all in the opinions. A requesting party who does not demand a given level of precision invites himself to be "snowed under" by a giant haystack which will itself require predictive coding to find the few and far-between needles (although this may sometimes be a desirable side-effect of a very high recall threshold). The precision level of the contested production in In re Biomet is not readily apparent from the briefs, but is theoretically at least 16% (the prevalence among the keyword-hit documents).
  • Review burden and predictive-coding precision. The In re Biomet defendant claimed that, to generate an adequate production, it expected to have to review 10% to 50% of the documents. Even at the high end of estimated prevalence--2.5%--defendant’s claim would imply a predictive-coding precision of as low as five percent. I cannot remember any case in which review of half that many documents was necessary. Considering the low prevalence, even ten percent seems a bit high, albeit not unreasonable.
  • Recall measurement and sample sizes. The court indicated that defendant calculated the prevalence at 1.37% to 2.47% in the overall dataset, and 0.55% to 1.33% in the non-keyword-hit population at the 99% confidence level. This implies review of a random sample of a little over 4,000 documents from the non-keyword-hit population. (A sample size calculator may be helpful here). A random sample of 4,000 documents, at a prevalence level of roughly 1%, yields only about 40 responsive exemplars on which to base a recall estimate. Given an estimated recall within the keyword-hit documents of 75%, that would imply a confidence interval (margin of error) of +- 17% at the 99% confidence level. Recall within the keyword-hit population could therefore have been as low as 57% (75% - 17%). Returning to our worst-case scenario analysis (only 22% of the responsive documents in the keyword-hit population, for 16.5% overall recall), this means that overall recall could have been as low as 12.6% (22% * 12.6%). At the 95% confidence level, the confidence interval would be +- 13%, for keyword-hit recall as low as 62%, and overall recall as low as 14%.
  • Proportionality. Proportionality requires consideration of two factors: benefit (to the requesting party, in terms of likely additional probative evidence) and burden (on the producing party, in terms of expense and effort). The analysis above indicates that the judge seriously underestimated the benefit to the requesting party of the additional discovery sought in this case. As a detailed examination of the burden on the producing party exceeds the scope of this blog entry, so must be a recommendation as to what the judge should ultimately have held.
  • Oddments. The court found that it would cost "a million, or millions, of dollars" to "test" the magnitude of the missing documents. Needless to say, the analysis above required no further sampling or review. The judge's remark that "Boolean language provides the basis for keyword searches, though I can't find anything in this record that equates today's keyword searches to Boolean searches" suggests general unfamiliarity with ESI retrieval.
  • Price. According to the defendant's brief, its vendor is charging some $680/gb (exclusive of ongoing monthly fees) for predictive coding and a panoply of ancillary services to include "processing," "extraction," "ingestion," and "enrichment." The defendant further claims that to perform predictive coding on the non-keyword-hit documents would incur an additional $2 million in vendor charges, including over $1 million for predictive coding alone. This expense would not include attorney review. Suffice it to say, Backstop believes that its prices compare favorably.

No comments:

Post a Comment