Thursday, May 25, 2023
GPT-4’s Law School Grades: Con Law C, Crim C-, Law & Econ C, Partnership Tax B, Property B-, Tax B
Andrew Blair-Stanek (Maryland; Google Scholar), Anne-Marie Carstens (Maryland), Daniel S. Goldberg (Maryland), Mark Graber (Maryland), David C. Gray (Maryland) & Maxwell L. Stearns (Maryland; Google Scholar), GPT-4’s Law School Grades: Con Law C, Crim C-, Law & Econ C, Partnership Tax B, Property B-, Tax B:
GPT-4 performs vastly better than ChatGPT or GPT-3.5 on legal tasks like the bar exam and statutory reasoning. To test GPT-4’s abilities, we ran it on our final exams this semester and graded its output alongside students’ exams. We found that it produced smoothly written answers that failed to spot many important issues, much like a bright student who had neither attended class often, nor thought deeply about the material. It uniformly performed below average—in every course. We provide observations that may help law professors detect students who cheat on exams using GPT-4.
GPT-4 performed well below average on all our exams this semester. In at least one instance, its failure to properly analyze an issue would have likely been malpractice that could have resulted in the client going to jail. This low performance comes despite GPT-4 having been trained on a huge web corpus that likely included every published U.S. case and statute, as well as much legal commentary.
We find that GPT-4 performs decently at multiple choice questions, across areas as diverse as Constitutional Law, Property, and Tax. This is consistent with its good performance on standardized multiple choice tests like the LSAT, SAT, and GRE. Given that GPT-4’s huge training corpus likely contains many multiple choice questions, it may be an expert at gaming them. Professors worried about GPT-4 based cheating might consider moving away from multiple choice, particularly since a one-letter answer is much harder than prose to check for GPT-4’s fingerprints.
On written questions like issue spotters, GPT-4 misses many obvious issues and lacks depth of analysis. Even when given a universe of authorities (e.g., cases) to draw from, it does not fully utilize them. Yet it has occasional flashes of brilliance like spotting issues missed by most students or analyzing remedies for claims it identifies. It often refers to doctrines by alternative names not used in class, as with “vested rights” rather than “nonconforming uses.” It sometimes spots entirely valid issues—on topics not actually covered in the course. GPT-4 produces unusually smooth and organized prose, often with helpful headers, numbering, and summaries. Hopefully these tendencies will help professors spot answers written by GPT-4 or similar models.
GPT-4 got its best grades this semester in the two tax-law courses. There are several possible explanations. One, both exams were heavy on multiple choice, one of GPT-4’s stronger areas. Two, the curving in tax classes may be more generous. Three, OpenAI’s use of a tax-law example during its 22-minute GPT-4 livestream kickoff suggests that some of GPT-4’s training may have been optimized to handle tax law.
In terms of future work, some of us and some of our colleagues may re-run this same experiment with the latest GPT model on our exams in the fall semester. We may also experiment with different parameters, such as seeing whether it gets better (or worse) grades when the model uses a higher “temperature,” making the text more random and creative.