Tuesday, December 5, 2023
Andrew Blair-Stanek (Maryland; Google Scholar), Nils Holzenberger (Institut Polytechnique de Paris) & Benjamin Van Durme (Johns Hopkins; Google Scholar), BLT: Can Large Language Models Handle Basic Legal Text?:
We find that the best publicly available LLMs like GPT-4 and PaLM 2 currently perform poorly at basic text handling required of lawyers or paralegals, such as looking up the text at a line of a witness deposition or at a subsection of a contract. We introduce a benchmark to quantify this poor performance, which casts into doubt LLMs' current reliability as-is for legal practice. Finetuning for these tasks brings an older LLM to near-perfect performance on our test set and also raises performance on a related legal task. This stark result highlights the need for more domain expertise in LLM training.
We demonstrate that the best currently available LLMs perform very poorly at many basic legal text-handling tasks. The chief innovation officer at large international law firm Baker & McKenzie observed to the New York Times of LLMs, "At its best, the technology seems like a very smart paralegal." (Lohr, 2023). It might be more accurate to say that LLMs are like very sloppy paralegals. We find poor performance from GPT-3.5-turbo and PaLM2 on our smallest test set, BLT-4k, and poor performance from even GPT-4 on portions of BLT4klikefindingthe text on one or two pages of deposition transcript. But fine-tuning on BLT-4k’s training set brings performance of GPT-3.5-turbo up to the expected human level of performance.
While we are focused on law, the BLT tasks are low level enough that we would expect these findings to be relevant to anyone, regardless of domain. Moreover, this poor performance on advanced LLMs as-is shows the importance of consulting domain experts in training of future LLMs.