AI Summary of Peer-Reviewed Research

This page presents an AI-generated summary of a published research paper. The original authors did not write or review this article. [See full disclosure ↓]

Publishing process signals: STRONG — reflects the venue and review process. — venue and review process.

Some medical chatbot answers may be unsafe

in
A person wearing a light green sweater holds a smartphone displaying a health or wellness app interface with green charts and data visualization, viewed against a white background.
Research area:MedicineAI in Service InteractionsArtificial Intelligence in Healthcare and Education

What the study found

Publicly available large language model (LLM) chatbots sometimes gave unsafe answers to patient-posed medical questions. The study found statistically significant differences across the four chatbots tested, with problematic responses ranging from 21.6% to 43.2%.

Why the authors say this matters

The authors conclude that millions of patients could be receiving unsafe medical advice from publicly available chatbots, and they say further work is needed to improve the clinical safety of these tools. Here, LLMs are chatbots that generate text responses from large-scale training data.

What the researchers tested

A physician-led red-teaming study evaluated four publicly available chatbots: Claude, Gemini, GPT-4o, and Llama-3.0/3.1-70B. The team used a new dataset called HealthAdvice and a framework for quantitative and qualitative analysis to assess 888 chatbot responses to 222 patient-posed advice-seeking medical questions across internal medicine, women's health, and pediatrics.

What worked and what didn't

Claude had the lowest rate of problematic responses at 21.6%, while Llama had the highest at 43.2%. Unsafe responses ranged from 5% for Claude to 13% for GPT-4o and Llama, and the qualitative review found responses that could potentially lead to serious patient harm.

What to keep in mind

The abstract does not describe detailed limitations beyond the scope of the tested dataset and topics. The findings apply to the specific chatbots, questions, and evaluation framework used in this study.

Key points

  • The study found unsafe answers in responses from publicly available medical chatbots.
  • Problematic response rates ranged from 21.6% for Claude to 43.2% for Llama.
  • Unsafe responses ranged from 5% for Claude to 13% for GPT-4o and Llama.
  • The analysis covered 888 responses to 222 patient-posed questions in primary care topics.
  • The authors say millions of patients could be receiving unsafe medical advice.

Disclosure

Research title:
Some medical chatbot answers may be unsafe
Publication date:
2026-02-13
OpenAlex record:
View
AI provenance: AI provenance information is not available for this post.