What the study found
GPT-o1 performed better than Llama-3.2-8b-instruct on clinically grounded causal reasoning tasks about laboratory test interpretation. The study found that GPT-o1 had higher overall discriminative performance, sensitivity, specificity, and reasoning ratings.
Why the authors say this matters
The authors conclude that GPT-o1 offers more consistent causal reasoning, and they suggest that further refinement is needed before high-stakes clinical deployment. The study suggests this type of testing may help assess how well large language models handle clinical causal reasoning.
What the researchers tested
The researchers evaluated causal reasoning in large language models using 99 laboratory test scenarios mapped to Pearl's Ladder of Causation, which includes association, intervention, and counterfactual reasoning. They focused on common tests such as hemoglobin A1c (HbA1c), creatinine, and vitamin D, paired with factors such as age, gender, obesity, and smoking. Two models were tested, GPT-o1 and Llama-3.2-8b-instruct, and responses were rated by four medically trained human experts.
What worked and what didn't
GPT-o1 had an overall AUROC of 0.80 ± 0.12, compared with 0.73 ± 0.15 for Llama-3.2-8b-instruct. It also scored higher on association, intervention, and counterfactual reasoning, and had better sensitivity and specificity. Both models did best on intervention questions and worst on counterfactual questions, especially altered outcome scenarios.
What to keep in mind
The abstract does not describe detailed limitations beyond stating that further refinement is needed before high-stakes clinical deployment. The findings are limited to the 99 scenarios and the two tested models.
Key points
- GPT-o1 outperformed Llama-3.2-8b-instruct in overall discriminative performance on clinical lab-test causal reasoning tasks.
- The study used 99 scenarios based on Pearl's Ladder of Causation: association, intervention, and counterfactual reasoning.
- Four medically trained human experts rated the models' responses.
- Both models performed best on intervention questions and worst on counterfactual questions, especially altered outcome scenarios.
- The authors say further refinement is needed before high-stakes clinical deployment.
Disclosure
- Research title:
- GPT-o1 outperformed Llama on clinical causal reasoning tasks
- Authors:
- Balu Bhasuran, Mattia Prosperi, Karim Hanna, John Petrilli, Caretia JeLayne Washington, Zhe He
- Institutions:
- Florida State University, University of Florida, University of South Florida, Florida A&M University – Florida State University College of Engineering
- Publication date:
- 2026-04-23
- OpenAlex record:
- View
- Image credit:
- Photo by www.kaboompics.com on Pexels · Pexels License
Get the weekly research newsletter
Stay current with peer-reviewed research without reading academic papers — one filtered digest, every Friday.


