The word about predictive coding has gotten out to practitioners involved in document-intensive business litigation and government investigations. At this point, I have had discussions with practitioners who shudder at the mention of electronic discovery, and even they have at least a passing familiarity with the concept. Thus, many people now know that predictive coding can allow for the coding of a smaller set of documents to be used to train software to identify relevant documents based on that set, and then to apply that coding logic to a larger set of documents in a statistically valid way. We are now at a point where clients, judges, and even certain government agencies are specifically requesting that predictive coding be used (or at least considered). As such, it is helpful to highlight certain key questions and decision points that need to be addressed when considering applying a predictive coding process. While there are number of considerations that must be taken into account, the following are what I have experienced to be the 5 key questions/decision points.
1. Will predictive coding work with my dataset?
The basic rule here is: the bigger the dataset, the better suited it is for predictive coding rather than running keywords and performing a straight review. I have heard claims from certain analysts that predictive coding can be effective and efficient in datasets as small as 50,000 documents. (But any specific number is really just a rule of thumb.) The other major consideration is the “richness” of the collected data in light of the scope of the issues in the case. For example, let’s assume that a small finance company is sued for a sexual harassment claim relating to a discrete one-time incident. Let’s also assume that the company has 200,000 documents of all varieties intermixed on a poorly organized shared network drive. If you know going in that 99% of those documents are purely related to economic models, and have nothing to do with any communications that would ever reasonably be relevant to the issues in the case, it likely does not make sense to go through the training of the predictive coding software that would be necessary to teach the computer to determine what is relevant. Rather, what would likely make more sense is to devise a relatively narrow and targeted set of keywords that would be necessary to isolate the potentially relevant documents, and then review each of those documents through a straight review. In an instance like this, the data would not be “rich” in the sense that it is heavily biased toward one topic—economic modelling—and that topic has very little relevance to the matter at hand.
2. Will predictive coding save my client time/money?
Not every case is appropriate for a workflow involving predictive coding. Based on my experience, there is one feature that I find most striking for a predictive coding review—the need at the outset for clarity and consistency in both the workflow and the substantive position on relevance. Unlike a traditional linear review, the project management and planning aspects of a predictive coding review are increased significantly, and their importance is difficult to overstate. In many ways, this is the hardest part about predictive coding for both practitioners and clients to understand and fully appreciate. In a linear review, if keywords change, additional documents can be pulled into the review set without disrupting prior work. Additionally, if a change in the substantive decision making related to a document changes, there does not have to be a systematic re-review of a seed set and other training sets. Essentially, mid-course changes are easier to deal with. Furthermore, the fact that there may still be a linear review of a (sometimes significant) number of documents after the predictive coding software has been trained is a fact that is sometimes lost on people. It is important to think about these issues and analyze them at the outset as much as possible.
3. Who will “train” the predictive coding software?
Once the decision has been made to proceed with predictive coding, the next question is which lucky person or persons gets the honor of reviewing documents in the “seed set” (and subsequent training sets) to essentially train the software. Since the decisions used to code the seed set (and subsequent training sets) will be used to make relevance decisions over a much larger set of documents—thus, multiplying the effect of each coding decision—it is important that the individual coding these training sets be an “expert” in the relevant facts. The individual(s) chosen should have sufficient case knowledge, understanding of the substantive law, and understanding of the seed set review process that is in place (for example, metadata and privilege should generally disregarded in making relevance decisions at this stage). Generally speaking, each “expert” should be an experienced attorney who has been involved in the early case assessment phases of the case and is not just seeing relevant documents for the first time during this round of the review. There is some debate as to whether it is better to train the predictive coding software with a single “expert” or multiple “experts.” If multiple “experts” are used, it is a best practice to have them work together closely and communicate regularly to ensure consistency in the seed set coding.
4. Should I use keywords or random sampling to develop the seed set?
When deciding what will be in the seed sets that get reviewed by each “expert” to train the system, the decision will need to be made as to whether the seed set contains a random sample of documents or some subset that has been derived from keyword searching. Again, there are different schools of thought here as to which methodology is better. In one sense, using a random sample, you can ensure that the data reviewed to train the system is not biased by the keywords that are chosen. In another sense, using keywords can help focus the seed set so that you can ensure that some of the most important and relevant documents are used in training the coding software. The use of keywords certainly interpolates a bias, but assuming the keywords are reasonable and based on sound legal judgment, there is an argument to be made that there is no harm to that bias. (Additionally, some have argued that even data that has not been keyword searched is already biased since only certain types of data, from certain custodians, during certain date ranges, would have been collected—thus, there were already some selection criteria applied.) I will leave it to others to advocate their preferred approach, but a decision will have to be made one way or the other based on the specific qualities of each dataset.
5. How do my numbers look?
Once the review has commenced, and the training sets have been reviewed, decisions will need to be made about what you will need to show to demonstrate the validity of the predictive coding results. This is where the statistics come in. First, you need to make sure that the size of the seed set was large enough to give you a sufficient confidence interval. I like to use a 95% +/- 3%, which is fairly standard, but other confidence intervals can be used. Then, after the testing sets are compared to the seed set, there are three primary statistical measures that are relevant to evaluating the statistical validity of the results – the precision, the recall, and the F-measure. Precision is the fraction of relevant documents within retrieved results. Recall is the fraction of retrieved relevant documents. F-measure is essentially the average between the system’s recall and precision. Tools related to these metrics are generally built into the predictive coding software programs. While it is important to have a familiarity with these metrics, practitioners do not need to become part-time statisticians to conduct a predictive coding review. It is probably impossible to suggest standard appropriate for these metrics, and for that reason, it is definitely worth working someone (often a vendor or other trained litigation support professional) who understands these metrics well. This input will be essential for making the legal determination as to whether the results of a predictive coding review pass muster.
1. Will predictive coding work with my dataset?
The basic rule here is: the bigger the dataset, the better suited it is for predictive coding rather than running keywords and performing a straight review. I have heard claims from certain analysts that predictive coding can be effective and efficient in datasets as small as 50,000 documents. (But any specific number is really just a rule of thumb.) The other major consideration is the “richness” of the collected data in light of the scope of the issues in the case. For example, let’s assume that a small finance company is sued for a sexual harassment claim relating to a discrete one-time incident. Let’s also assume that the company has 200,000 documents of all varieties intermixed on a poorly organized shared network drive. If you know going in that 99% of those documents are purely related to economic models, and have nothing to do with any communications that would ever reasonably be relevant to the issues in the case, it likely does not make sense to go through the training of the predictive coding software that would be necessary to teach the computer to determine what is relevant. Rather, what would likely make more sense is to devise a relatively narrow and targeted set of keywords that would be necessary to isolate the potentially relevant documents, and then review each of those documents through a straight review. In an instance like this, the data would not be “rich” in the sense that it is heavily biased toward one topic—economic modelling—and that topic has very little relevance to the matter at hand.
2. Will predictive coding save my client time/money?
Not every case is appropriate for a workflow involving predictive coding. Based on my experience, there is one feature that I find most striking for a predictive coding review—the need at the outset for clarity and consistency in both the workflow and the substantive position on relevance. Unlike a traditional linear review, the project management and planning aspects of a predictive coding review are increased significantly, and their importance is difficult to overstate. In many ways, this is the hardest part about predictive coding for both practitioners and clients to understand and fully appreciate. In a linear review, if keywords change, additional documents can be pulled into the review set without disrupting prior work. Additionally, if a change in the substantive decision making related to a document changes, there does not have to be a systematic re-review of a seed set and other training sets. Essentially, mid-course changes are easier to deal with. Furthermore, the fact that there may still be a linear review of a (sometimes significant) number of documents after the predictive coding software has been trained is a fact that is sometimes lost on people. It is important to think about these issues and analyze them at the outset as much as possible.
3. Who will “train” the predictive coding software?
Once the decision has been made to proceed with predictive coding, the next question is which lucky person or persons gets the honor of reviewing documents in the “seed set” (and subsequent training sets) to essentially train the software. Since the decisions used to code the seed set (and subsequent training sets) will be used to make relevance decisions over a much larger set of documents—thus, multiplying the effect of each coding decision—it is important that the individual coding these training sets be an “expert” in the relevant facts. The individual(s) chosen should have sufficient case knowledge, understanding of the substantive law, and understanding of the seed set review process that is in place (for example, metadata and privilege should generally disregarded in making relevance decisions at this stage). Generally speaking, each “expert” should be an experienced attorney who has been involved in the early case assessment phases of the case and is not just seeing relevant documents for the first time during this round of the review. There is some debate as to whether it is better to train the predictive coding software with a single “expert” or multiple “experts.” If multiple “experts” are used, it is a best practice to have them work together closely and communicate regularly to ensure consistency in the seed set coding.
4. Should I use keywords or random sampling to develop the seed set?
When deciding what will be in the seed sets that get reviewed by each “expert” to train the system, the decision will need to be made as to whether the seed set contains a random sample of documents or some subset that has been derived from keyword searching. Again, there are different schools of thought here as to which methodology is better. In one sense, using a random sample, you can ensure that the data reviewed to train the system is not biased by the keywords that are chosen. In another sense, using keywords can help focus the seed set so that you can ensure that some of the most important and relevant documents are used in training the coding software. The use of keywords certainly interpolates a bias, but assuming the keywords are reasonable and based on sound legal judgment, there is an argument to be made that there is no harm to that bias. (Additionally, some have argued that even data that has not been keyword searched is already biased since only certain types of data, from certain custodians, during certain date ranges, would have been collected—thus, there were already some selection criteria applied.) I will leave it to others to advocate their preferred approach, but a decision will have to be made one way or the other based on the specific qualities of each dataset.
5. How do my numbers look?
Once the review has commenced, and the training sets have been reviewed, decisions will need to be made about what you will need to show to demonstrate the validity of the predictive coding results. This is where the statistics come in. First, you need to make sure that the size of the seed set was large enough to give you a sufficient confidence interval. I like to use a 95% +/- 3%, which is fairly standard, but other confidence intervals can be used. Then, after the testing sets are compared to the seed set, there are three primary statistical measures that are relevant to evaluating the statistical validity of the results – the precision, the recall, and the F-measure. Precision is the fraction of relevant documents within retrieved results. Recall is the fraction of retrieved relevant documents. F-measure is essentially the average between the system’s recall and precision. Tools related to these metrics are generally built into the predictive coding software programs. While it is important to have a familiarity with these metrics, practitioners do not need to become part-time statisticians to conduct a predictive coding review. It is probably impossible to suggest standard appropriate for these metrics, and for that reason, it is definitely worth working someone (often a vendor or other trained litigation support professional) who understands these metrics well. This input will be essential for making the legal determination as to whether the results of a predictive coding review pass muster.
Related Insights
20 December 2024
Health Care Law Today
GLP-1 Drugs: FDA “Re-Confirms” Decision Removing Tirzepatide from the Drug Shortage List
On December 19, 2024, the U.S. Food and Drug Administration (FDA) issued a Declaratory Order reevaluating and re-confirming that the tirzepatide drug shortage has been resolved. This order revoked and replaced FDA’s October 2, 2024, decision on tirzepatide.
20 December 2024
Manufacturing Industry Advisor
Christmas Came Early: Justice Delivered in Supplier Dispute Over Unjust Enrichment
The AirBoss saga continues… This holiday season, AirBoss Flexible Products Co. received a monumental legal victory, righting a costly wrong in MSSC, Inc. v. AirBoss.
19 December 2024
Health Care Law Today
HIPAA Reproductive Health Care Amendments: Compliance in an Uncertain Enforcement Landscape
The amendments to the HIPAA Privacy Rule designed to protect reproductive health care information are under legal challenge as the compliance date quickly approaches.