Predictive coding, also known as computer assisted coding and technology assisted review, all refer to the act of using computers and software applications which use machine learning algorithms to enable a computer to learn from records presented it (usually from human attorneys) as to what types of content are potentially relevant to a given legal matter. After a sufficient number of examples are provided by the attorneys, the technology is given access to the entire potential corpus (records/data) to sort through and find records that, based on its “learning”, are potentially relevant to the case.
This automation can dramatically reduce costs due to the fact that computers, instead of attorneys conduct the first pass culling of potentially millions of records.
Predictive coding has several very predictable dependencies that need to be addressed to be accepted as a useful and dependable tool in the eDiscovery process. First, which documents/records are used and who chooses them to “train the system”? This training selection will almost always be conducted by attorneys involved with the case.
The second dependency revolves around the number of documents used for the training. How many training documents are needed to provide the needed sample size to enable a dependable process?
And most importantly, do the parties have access to all potentially relevant documents in the case to draw the training documents from? Remember, potentially relevant documents can be stored anywhere. For predictive coding, or any other eDiscovery process to be legally defensible, all existing case related documents need to be available. This requirement highlights the need for effective information management by all in a given organization.
As the courts adopt, or at least experiments with predictive coding, as Judge Peck did in Monique Da Silva Moore, et al., v. Publicis Groupe & MSL Group, Civ. No. 11-1279 (ALC)(AJP) (S.D.N.Y. February 24, 2012, an effective information management program will become key to he courts adopting this new technology.
One thought on “Information Governance and Predictive Coding”
We must also remember that the majority of electronic discovery tools are based upon Bayesian statistical models. Simply changing one variable false positives which then have to be rescanned wasting time and money.
I believe John Martin from Beyond Recognition has a tool and methodology that significantly reduces scan time while also increasing probability to 5 9s.