Tuesday, June 4, 2024
GPT-4 Didn't Ace The Bar Exam After All, MIT Research Suggests
Live Science, GPT-4 Didn't Ace the Bar Exam After All, MIT Research Suggests — It Didn't Even Break the 70th Percentile:
Last year, claims that OpenAI's GPT-4 model beat 90% of trainee lawyers on the bar exam generated a flurry of media hype. But these claims were likely overstated, a new study suggests.
GPT-4 didn't actually score in the top 10% on the bar exam after all, new research suggests.
OpenAI, the company behind the large language model (LLM) that powers its chatbot ChatGPT, made the claim in March last year, and the announcement sent shock waves around the web and the legal profession.
Now, a new study has revealed that the much-hyped 90th-percentile figure was actually skewed toward repeat test-takers who had already failed the exam one or more times — a much lower-scoring group than those who generally take the test. The researcher published his findings March 30 in the journal Artificial Intelligence and Law.
"It seems the most accurate comparison would be against first-time test takers or to the extent that you think that the percentile should reflect GPT-4's performance as compared to an actual lawyer; then the most accurate comparison would be to those who pass the exam," study author Eric Martínez, a doctoral student at MIT's Department of Brain and Cognitive Sciences, said at a New York State Bar Association continuing legal education course.
Boing Boing, ChatGPT Not That Great at Bar Exam After All:
When GPT-4 was released, one of the hype lines was that it passed the bar exam at the 90th percentile. "GPT-4 Passes the Bar Exam: What That Means for Artificial Intelligence Tools in the Legal Profession," explainered Stanford Law School. The claim is in the first paragraph of OpenAI's announcement.
Efforts have been made to replicate the claim and the results aren't so impressive.
Eric Martínez (MIT; Google Scholar), Re-evaluating GPT-4’s Bar Exam Performance:
Perhaps the most widely touted of GPT-4’s at-launch, zero-shot capabilities has been its reported 90th-percentile performance on the Uniform Bar Exam. This paper begins by investigating the methodological challenges in documenting and verifying the 90th-percentile claim, presenting four sets of findings that indicate that OpenAI’s estimates of GPT-4’s UBE percentile are overinflated.
First, although GPT-4’s UBE score nears the 90th percentile when examining approximate conversions from February administrations of the Illinois Bar Exam, these estimates are heavily skewed towards repeat test-takers who failed the July administration and score significantly lower than the general test-taking population. Second, data from a recent July administration of the same exam suggests GPT-4’s overall UBE percentile was below the 69th percentile, and 48th percentile on essays. Third, examining official NCBE data and using several conservative statistical assumptions, GPT-4’s performance against first-time test takers is estimated to be 62nd percentile, including 42nd percentile on essays. Fourth, when examining only those who passed the exam (i.e. licensed or license-pending attorneys), GPT-4’s performance is estimated to drop to 48th percentile overall, and 15th percentile on essays.
In addition to investigating the validity of the percentile claim, the paper also investigates the validity of GPT-4’s reported scaled UBE score of 298. The paper successfully replicates the MBE score, but highlights several methodological issues in the grading of the MPT + MEE components of the exam, which call into question the validity of the reported essay score.
Finally, the paper investigates the effect of different hyperparameter combinations on GPT-4’s MBE performance, finding no significant effect of adjusting temperature settings, and a significant effect of few-shot chain-of-thought prompting over basic zero-shot prompting. Taken together, these findings carry timely insights for the desirability and feasibility of outsourcing legally relevant tasks to AI models, as well as for the importance for AI developers to implement rigorous and transparent capabilities evaluations to help secure safe and trustworthy AI.
Prior TaxProf Blog coverage:
- Josh Blackman (South Texas), GPT Will Soon Be Able To Pass The Multistate Bar Exam (Jan. 5, 2023)
- Daniel Martin Katz (Chicago Kent) et al., GPT-4 Beats 90% Of Aspiring Lawyers On The Bar Exam (Mar. 17, 2023)
https://taxprof.typepad.com/taxprof_blog/2024/06/gpt-4-didnt-ace-the-bar-exam-after-all-mit-research-suggests.html