Tuesday, April 22, 2014

Information Governance Calculator












In a recent blog post, I discussed the interaction between recall, precision, prevalence, storage costs, and review costs in the context of information governance.

Backstop has now created an experimental Information Governance Calculator, to help illustrate these interactions and to calculate the savings made possible through IG in conjunction with predictive coding. The calculator and associated graphs may be accessed at this URL:

http://www.backstopllp.com/igcalc/igcalc.html

Points of interest that may be gleaned from the calculator include:

  • The higher the precision, the lower the storage costs and the greater the storage savings, all other things being equal.  The inverse is also true.
  • The available storage savings correlate directly with the document population size (unsurprisingly).
  • Higher prevalence leads to higher storage costs at any given accuracy level (viz., at any given combination of recall and precision).  The same is true as document population increases.
  • Review cost (to find random exemplars) depends entirely upon and correlates directly with prevalence, and is independent of document population size.
  • Low precision does not affect review costs, unlike in the context of review-to-produce.
  • Total cost often declines as recall increases because of the decreased random review burden.
  • In general, it should not be difficult to create a predictive-coding model because (non-random) relevant exemplars will be abundant and easily identified in the IG context. Difficulty in finding relevant exemplars will occasion additional review costs.
The calculator currently shows storage savings only. It may be updated in the future to calculate total cost savings and ROI by incorporating the cost of predictive coding. The per-document predictive-coding cost must be lower than the per-document storage cost to make the IG exercise worthwhile. Given the minuscule cost of storage relative to common predictive-coding costs, and how much easier it is to store documents than to apply predictive coding to them, this undertaking may be viable principally for exceptionally large volumes of documents that generate sufficient revenue to yield a per-document cost far below (by at least one order of magnitude) what is typical in the review-to-produce context.

Friday, April 18, 2014

Heartbleed

The Heartbleed bug has attracted considerable attention, deservedly so. Particularly in light of the law firm security concerns recently highlighted in the New York Times, law firms and other consumers of e-discovery services indubitably worry about the vulnerability to this bug of data stored with their vendors.

The Heartbleed bug affects applications written with the OpenSSL cryptographic software library, allowing access to data stored in the memory of systems using defective versions of the software. No part of the Backstop suite of offerings makes or has ever made use of the OpenSSL library. Backstop servers are therefore unaffected by the Heartbleed bug.

Clients and prospective clients with questions in this connection should feel free to contact our Director of Technology, Brian Merrell.

Wednesday, March 12, 2014

Dubious claims about predictive coding for "information governance"



  • Information governance ("IG") basically means defensible deletion.
  • A vendor’s claim to have achieved 90% precision with de minimis document review in an IG proof-of-concept omitted any mention of recall and is therefore suspect, for recall is the touchstone of defensibility.
  • The claim appears to understate by multiple orders of magnitude the number of documents that would have required review in order to verifiably achieve the results claimed.
  • In the IG world of low prevalence, low precision is not an important issue, and higher recall can be achieved at lower cost.
  • Mass culling should not be overlooked as a supplement to predictive coding.
  • Persistent analysis is a fecund field for investigation.


I recently attended a symposium on “information governance” at the University of Richmond Law School, sponsored by the Journal of Legal Technology. Kudos to Allison Rienecker and the JOLT team for a well-run event.

At the symposium, a well-known predictive-coding vendor made some interesting and I daresay misleading claims about an IG “proof of concept” which purportedly would have enabled a corporation safely to discard millions of documents after review of about 1,800 despite prevalence of just four-hundredths of a percent. A screen-capture summary of the POC and the vendor's key claims, and a full video of the presentation, are below. The main discussion of the POC begins at around the 2:36:20 mark of the video and lasts for about 5 minutes.





The presenter boasted of impressive-sounding 90% precision, but said nothing of recall, nor can I fathom how the vendor could have determined recall under the circumstances. Law firms and corporate clients should beware of this claim and of any claim that does not address recall. IG has the potential to be cost-effective, including in the dataset discussed by the vendor.  But the vendor appears to have understated by multiple orders of magnitude the number of documents that would have required review in order to verify the results claimed.

Friday, November 22, 2013

Comment on early EDI - Oracle Study results

The Electronic Discovery Institute yesterday released via Law Technology News some preliminary results of its study on a dataset provided by Oracle related to its acquisition of Sun Microsystems. The study involved multiple providers of technology-assisted review, including Backstop, which categorized documents for three tags: responsive, privilege, and hot. EDI's release is the first step towards what should eventuate in ground-breaking raw data and analysis. While skeletal (we only have ordinal F1 rankings thus far), it affords the basis for some thoughts, including very imperfect cost-adjusted performance measures.  Interestingly, the results show no correlation between cost and accuracy ranking. Backstop and the other study participants are forbidden to identify their own entries, and EDI can only tell a vendor which results are the vendor's own. So, while we are very pleased with our results, we cannot identify them or those of any other participant.  In this post I will share some thoughts on difficulties with F1 as a benchmark for accuracy, then delve into a first attempt at a cost-adjusted performance spreadsheet, which you can sort, view, edit and download.

Wednesday, November 20, 2013

EDI - Oracle Study Preliminary Results Released

Preliminary results have been released from the Electronic Discovery Institute - Oracle study, in which Backstop participated.  See an article at Law Technology News and the chart below.  We are very pleased to see these results and will soon share a preliminary analysis on this blog.  We also look forward to seeing more granular detail (viz., recall and precision figures) in the near future.


Thursday, May 23, 2013

Podcast on the In re Biomet decision

A couple of weeks ago, I discussed here the dubious mathematics underlying the court's approval of pre-predictive coding keyword searches in In re Biomet.  This morning I discussed the case with other e-discovery professionals on an ESI Bytes podcast.

Wednesday, May 8, 2013

Federal court approves pre-predictive coding keyword filtration based on faulty math in In re Biomet

A district court’s recent approval of keyword filtration prior to the use of predictive coding in In re Biomet, No. 3:12-MD-2391 (N.D. Ind. April 18, 2013) rests on bad math and could deprive the requesting party of over 80% of the relevant documents. Specifically, the court ruled that a defendant’s use of predictive coding on a keyword-culled dataset met its discovery obligations because only a “modest” number of documents would be excluded. But a proper analysis of the statistical sampling on which the court relied shows that defendant’s keyword filtration would deprive plaintiffs of a substantial proportion of the relevant documents. The error in the court’s finding regarding the completeness of defendant’s production underpinned and undermines its additional holding that to require the defendant to employ predictive coding on the full dataset would offend Rule 26(b)(2)(C) proportionality. Accordingly, the early chorus of praise which has greeted the decision is unwarranted.