Wednesday, March 12, 2014

Dubious claims about predictive coding for "information governance"



  • Information governance ("IG") basically means defensible deletion.
  • A vendor’s claim to have achieved 90% precision with de minimis document review in an IG proof-of-concept omitted any mention of recall and is therefore suspect, for recall is the touchstone of defensibility.
  • The claim appears to understate by multiple orders of magnitude the number of documents that would have required review in order to verifiably achieve the results claimed.
  • In the IG world of low prevalence, low precision is not an important issue, and higher recall can be achieved at lower cost.
  • Mass culling should not be overlooked as a supplement to predictive coding.
  • Persistent analysis is a fecund field for investigation.


I recently attended a symposium on “information governance” at the University of Richmond Law School, sponsored by the Journal of Legal Technology. Kudos to Allison Rienecker and the JOLT team for a well-run event.

At the symposium, a well-known predictive-coding vendor made some interesting and I daresay misleading claims about an IG “proof of concept” which purportedly would have enabled a corporation safely to discard millions of documents after review of about 1,800 despite prevalence of just four-hundredths of a percent. A screen-capture summary of the POC and the vendor's key claims, and a full video of the presentation, are below. The main discussion of the POC begins at around the 2:36:20 mark of the video and lasts for about 5 minutes.





The presenter boasted of impressive-sounding 90% precision, but said nothing of recall, nor can I fathom how the vendor could have determined recall under the circumstances. Law firms and corporate clients should beware of this claim and of any claim that does not address recall. IG has the potential to be cost-effective, including in the dataset discussed by the vendor.  But the vendor appears to have understated by multiple orders of magnitude the number of documents that would have required review in order to verify the results claimed.


"IG" is this year’s e-discovery buzzword. Basically, it means defensible deletion—identifying which electronic documents an entity (typically a large business) must preserve because of regulations, litigation holds, probable litigation, or otherwise, and which it may discard without fear of consequence. Such segregation has risen in importance with the proliferation of electronic business data, the increasing volume of is expensive to store. One symposium speaker claimed, for example, that a petabyte (one million gigabytes) costs $5 million per year to store. In e-discovery jargon IG typically means that the segregation is performed using predictive coding or other technology.

The vendor described its POC as follows. The client, a foreign bank via a law firm, had a population of four million emails it wanted to get rid off. Documents that had to be preserved (which I will call “Preservable Documents”) fell into two categories, with a combined prevalence of four-hundredths of a percent (0.0004). The client set a goal of 90% precision. There were 45 Preservable Documents and 48 non-Preservable Documents already classified as such, presumably from a keyword search or request to a custodian. The vendor identified 520 additional documents for review, at which point its model stabilized. The client then reviewed 567 documents for “QA,” for a maximum total of 1,180 documents reviewed.  (It was unclear whether the QA documents were redundant of the 520 training documents).

Based on that review, the vendor identified 1,533 Preservable Documents with 90% precision, enabling the client to discard the rest (some four million), yielding untold savings in storage costs, especially if scaled across the enterprise.

The main objective of IG, recall, was absent from the vendor’s discussion of this POC, and based on the claimed prevalence and number of documents reviewed, it is hard to see how the vendor could have determined it. A recall measurement generally requires a random sample that includes a minimum number of positive observations—in this case, documents that require preservation. Suppose that the client set a target of 85% recall. To attain a 95% confidence level that an 85% recall measurement is accurate to within +- 10% (meaning recall could actually be as high as 95% or as low as 75%) requires a random sample with 49 positive observations. (With a statistical calculator you can experiment with other figures). To get 49 positive observations at the claimed prevalence of 0.04% would require review of more than 120,000 random documents (49 / 0.0004 = 122,500).

Without knowing recall, there is no way to assess the quality of the preservation. The vendor found 1,533 Preservable Documents. That may be good if there were only, say, 1,700 Preservable Documents—recall of 90%. But what if there were 4,500 Preservable Documents (~33% recall), or 15,000 (10%), or 150,000 (1%)?

Perhaps the vendor can point to the asserted prevalence of 0.04%, which would imply 1,500 Preservable Documents from a population of four million, and good recall by the vendor. But that begs the question, how did the vendor determine prevalence? The vendor claimed to have reviewed, at most, 1,180 documents. Even if all of these documents were selected at random (the vendor did not specify selection criteria), for the observed prevalence of 0.04% this would yield a confidence interval of +- 0.077% (at the 95% confidence level). This means that prevalence could have been thrice as high (0.0004 + 0.00077 = 0.00117 ~ 3 * 0.0004), which would cap recall at 33%, an indefensible level.

The claimed prevalence seems strikingly close to that implied by the 1,533 Preservable Documents identified by the vendor, which raises the possibility that the vendor bootstrapped the prevalence calculation based on the documents it found rather than conducting any statistical validation.

It is possible that the client already knew prevalence and told the vendor, in which case this POC vastly understates the need for document review because such information would not, of course, be available in a real-world case. To calculate prevalence with the specificity required for, say, +-10% confidence in a recall measurement of 85%, would require (pointless) review of a random sample of some 148,000 documents. (The math is as follows: 85% recall implies that 1303 of the 1533 Preservable Documents are found and 230 are omitted. Thus, our target proportion of Preservable Documents in the unreviewed set is approximately 0.0000575 (230 Preservable Documents / 3,998,820 unreviewed docs). For a +- 10% confidence interval, we must consider 95% recall (1456/1533 Preservable Documents found, 77 omitted, 0.0000193 proportion) and 75% recall (1150/1533 found, 383 omitted, 0.0000958 proportion). Thus, we must identify the sample size necessary to measure a proportion of 0.0000575 plus or minus 0.0000383, which is the difference between the target proportion and the highest and lowest acceptable bounds (0.0000193 and 0.0000958, respectively). The requisite sample size is about 145,000, depending on the extent of rounding).

Without knowing recall, it is also impossible to evaluate whether the precision claim is as impressive as it doubtless sounds to many an audience. Ninety percent precision with 10% recall would not be defensible. The vendor could have achieved 100% precision by just returning the initial seed set of [93]. It bears note that even a facially low precision can be very helpful if recall is high and prevalence small. For example, with prevalence of 1%, recall of 90% and precision of 5% would entail elimination of over 80% of the data.

Towards the end of the presentation (2:44:48 - 2:45:26), another oddment emerged when the vendor said that it enabled the client to eliminate from preservation 45% of the dataset, or about 1.8 million documents. If the vendor found all or substantially all of the 1533 Preservable Documents with 90% precision, why not eliminate all four million (minus the 1533)? Even 1% precision would eliminate over 95% of the dataset.

None of this is to say that IG is unimportant or unachievable using predictive coding. As shown above, 85% recall in this case could verifiably have been achieved through review of about 122,500 documents, a number that does not increase with the overall volume of documents (because requisite sample sizes do not increase with population). Ninety percent recall would have required review of even fewer documents. That review expense may be an order or orders of magnitude lower than the storage required to preserve everything.

Presented with a real-world case involving the circumstances in the vendor’s POC, we would have recommended something along the lines of the following steps:
  1. Conduct random review. As very low prevalence became readily apparent, we would advise the client to consider targeted culling. For example, it might be possible to identify and eliminate custodians and date ranges that could not conceivably contain Preservable Documents

  2. Concurrently, seed the Backstop model with known training exemplars, which should be abundant and easy to identify.

  3. Concurrently with steps 1 and 2, choose a target recall of 90% or 95%, which in turn will determine the number of positive observations required. Usually this step would be independent of steps 1 and 2 and would be guided by achievable precision at various recall levels. In this case, however, minuscule prevalence would soon become evident, shifting the dominant consideration to sample size. A recall measurement of 90% +- 10% would require 35 random positive observations, which at 0.0004 prevalence would require review of 87,500 documents. A recall of 95% +- 10% would require only 19 positive observations, or review of 47,500 documents. Here the savings in reviewed documents would likely dwarf the expected cost of storage occasioned by lower precision. For example, assume that
    • it costs $1 to attorney-review each document
    • it costs one-tenth of a cent ($0.001) to store a document for a year (based on assumptions of $5M / petabyte per another presenter, 5,000 docs per gigabyte)
    • 100% precision is available at 85% recall
    • Precision falls to 10% at 90% recall
    • Only 1% precision is available at 95% recall 

    On these assumptions, we can compare the cost of different recall and precision levels:

    95% recall 90% recall 85% recall
    Exemplars needed
    19
    35
    49
    Docs to review at random to attain needed exemplars
    50,667
    93,333
    130,667
    Cost of random review
    $50,667
    $93,333
    $130,667
    Precision
    1%
    10%
    100%
    Preservable Documents identified and stored
    1,456
    1,303
    1,150
    non-Preservable Documents id'd and stored
    145,600
    13,030
    0
    Total docs identified and stored
    147,056
    14,333
    1150
    Storage cost
    $147.06
    $14.33
    $1.15
    Total cost
    $50,814.06
    $93,347.33
    $130,668.15

    Not only is the additional storage cost negligible compared to the additional review needed, but the most defensible recall level (the highest) is also the most economical.

  4. Once enough random review is complete to attain the desired recall measurement identified in step (3), run the model and identify documents suitable for deletion.
In light of the foregoing discussion,  a few key points are apparent about the IG realm: Because of ultra-low prevalence, review costs dominate storage costs because of the many documents which must be reviewed in order to obtain adequate sample sizes. It follows that low precision is not a significant obstacle, because precision dictates storage costs. And because a higher recall measurement requires fewer positive observations, and because higher recall is more defensible, it is possible to obtain more defensible results at lower cost.  The interplay between recall, precision, prevalence, review cost, and storage cost based on the POC can be seen in the graphs below.  Two points stand out: First, cost declines as recall increases, a function of the fewer relevant exemplars that need be found for higher levels of recall in order to attain a confidence interval of +-10%.  Second, storage costs even at 1% precision are negligible both as a sum and as a proportion of the total cost, so much so that they can scarcely be seen on some of the graphs.






A couple of additional thoughts on IG.  First, IG holds considerable promise to nip problems in the bud through persistent analysis. Suppose that a company has identified previous FCPA or antitrust misconduct, for example. It will have identified a store of related documents. It might be helpful to generate a model for such documents using predictive coding and, on a regular basis, analyze all new company documents and mark for review by internal counsel those most indicative of potential problems. Second, different subcategories of Preservable Documents may need to be analyzed independently to achieve a given recall level with respect to each, as opposed to a single, over-arching recall level. For example, imagine a company with two categories of Preservable Documents, category A with 90,000 documents and category B with 10,000 documents. You could achieve 90% recall of all Preservable Documents by finding all 89,000 category A documents and 1,000 category B documents. Recall for category A would be almost 99%. But recall with respect to category B would be only 10%.

1 comment: