Breaking Language Barriers with Generative AI: How Foley & Lardner Conducted Multilingual Document Review with Relativity aiR for Review

Foley & Lardner LLP (Foley) and Relativity began experimenting with GPT products in 2023, which evolved into the “aiR” suite of GPT-based tools currently available today. Foley’s participation and success with early experiments encouraged Foley legal teams to push novel uses of aiR to see how it would perform.
One such experiment was a multilingual (Spanish and English) internal investigation. Traditionally, when faced with a foreign language review, additional time and money are spent on fluent reviewers and translations. Only one of the team strategists driving the scope of the investigation was fluent in Spanish. It was the ideal situation to test aiR’s ability to not only translate but to analyze and understand the language well enough to generate support and citations for its recommendations.
The English-language case strategist drafted an English-language-only prompt in aiR for Review to identify five core issues. The results were extraordinary in that aiR was able to identify the issues, understand Spanish language, and provide English-language output. Validation tests were performed on the accuracy of the issue detection at both the record and issue level. The citations also were validated by Foley’s multilingual attorneys to ensure they were supportive of the analysis. The cost and time this saved over generating record-level issue analysis and translation before aiR cannot be overstated.
The need for document translation and the time and expense related to translation was eliminated. This allowed all case strategists to see English results overnight and quickly move on to counseling the client.
The Importance of Prompt Iteration
Establishing appropriate prompt criteria, the inputs that give aiR for Review the context it needs to evaluate data, is essential to successful and accurate output. The prompt criteria used for this analysis were developed using an iterative approach. This iterative approach allowed the review team to review initial prompt results and adjust the prompt criteria to either correctly categorize documents based on the initial understanding of the matter or revise the prompt to account for new information discovered while reviewing the documents.
First, the initial prompt criteria were based on instructions that had been provided to the bilingual reviewers by the case strategists. Some Spanish-language search term review had been done up until this point, and a handful of those materials were used for the prompt testing. Could aiR find what Foley reviewers already knew?
These initial criteria were tested across 50 previously identified “hot” documents with relevant issue tags to determine if aiR for Review could identify the same issues across the sample set and appropriately provide Spanish-language citations. The results were QC’d, and reviewers provided feedback on documents that aiR identified as “borderline.” Based on the QC feedback, the prompt was revised with additional instructions as to how these borderline documents should be categorized. Even further revision of the prompt following additional human review of relevant documents resulted in an improvement of recommendations from borderline to relevant, when tested by a random sampling of 100 new, unreviewed documents.
The workflow Foley established was to conduct human review over any borderline document.
This process demonstrates the importance of an iterative approach to developing prompt criteria. By testing an initial version of the prompt on a small sample, the Foley team was able to evaluate aiR’s interpretations and understand needed inputs that would lead to more accurate predictions. While reviewing the sample, the reviewers were able to further enhance the information provided in the prompt. These adjustments to the initial prompt improved results and established the confidence needed to leverage the technology across a wider data set.
Putting aiR into Action
After the prompt criteria were finalized, aiR for Review was run across a set of unreviewed material that hit on prioritized search terms. There were 2,292 records analyzed, of which 589 were determined to be relevant to the issues or borderline. An additional 385 documents received were not appropriate for aiR to Review to analyze due to format limitations. To assess the output, human reviewers QC’d the relevant, borderline, and unanalyzed records (974 documents in total).
The results were impressive: 6% (55) of the documents received reviewer feedback, the vast majority of which were those that aiR identified as “borderline.” Only two documents were assessed incorrectly; in those instances, aiR was overinclusive, interpreting a document as relevant when it was not.
Overall, less than 1% of aiR’s recommendations were overturned by human reviewers in the QC process. Based on these very strong results, no additional adjustments were made to the prompt criteria, and aiR was used on additional and larger sets of documents.
Since the subjective review of aiR for Review’s performance was positive, Foley enlisted the help of Relativity’s data scientist to confirm the subjective results objectively. Foley’s bilingual subject-matter expert (SME) for the project reviewed aiR for Review’s results from samples sets of documents designed by Relativity’s Data Scientists. The SME was instructed to assess both (1) aiR for Review’s issue detection and (2) citation support for its conclusions. At the conclusion of the SME review, issue and citation validation both yielded a <1% error rate.
The investigation is ongoing at the time this article was published, and based on these results, the Foley team has continued to use aiR for Review to accelerate the review of pertinent documents.
Breaking Language Barriers with Generative AI
aiR for Review demonstrated a remarkable ability to analyze Spanish material and identify Spanish citations, all while providing robust reasoning around its decisions in English, quickly and accurately. When a multilingual partner reviewed the rationale and citations for accuracy, over 99% of the rationales were determined to be correctly interpreted, and 89% of the citations were determined to correctly support the analysis.
This capability introduces tremendous opportunity for efficiency in multilingual cases. Alongside faster and more accurate analysis, aiR for Review can help reviewers easily summarize and report on documents where the source language is not understood by seasoned attorneys, subject-matter experts, and key stakeholders. In more complex cases, where subject matter experts are needed, firms can hire individuals based on their expertise without worrying about language barriers and then use aiR for Review’s reasoning and rationale to help those experts understand the nature of the matter, even if they are not familiar with the language in the source documents.
While we are just beginning to explore generative AI’s ability to work across languages, the initial results show great promise when it comes to using generative AI to drive time and cost savings for multilingual document review.
If you have questions about aiR for Review or Foley’s use of Artificial Intelligence, contact the authors or your Foley & Lardner attorney.