A benchmark of expert-level academic questions to assess AI capabilities

A student in dark clothing leans over a desk inside a wooden-framed window or booth area, focused on reading or writing documents spread before them, with curtains visible in the background.
Image Credit: Photo by Yanhao Fang on Unsplash (SourceLicense)

About This Article

This is an AI-generated summary of a research paper. The original authors did not write or review this article. See full disclosure ↓

Nature·2026-01-28·View original paper →

Overview

Large language models have rapidly achieved high performance on existing benchmarks, with accuracy exceeding 90% on widely-used assessments such as Measuring Massive Multitask Language Understanding. This saturation limits the capacity to differentiate between state-of-the-art systems and measure continued progress in LLM capabilities. To address this limitation, researchers developed Humanity's Last Exam (HLE), a multi-modal benchmark positioned at the frontier of human knowledge and designed to evaluate expert-level academic competence across a broad range of disciplines. The benchmark comprises 2,500 questions spanning dozens of subjects including mathematics, humanities, and natural sciences. Each question features unambiguous solutions that are verifiable but not readily obtainable through simple internet retrieval, establishing a higher bar for assessment than previous benchmarks.

Methods and approach

HLE was developed through a global collaboration involving subject-matter experts who contributed questions across multiple academic domains. The benchmark employs both multiple-choice and short-answer question formats, all designed to be suitable for automated grading while maintaining expert-level difficulty. Questions were constructed to have known, unambiguous solutions that can be easily verified but require deep domain expertise rather than information retrieval. The Center for AI Safety and Scale AI consortia jointly designed the dataset premise and pipeline, operated the data collection platform, and provided funding and inference infrastructure for LLMs. The HLE Contributors Consortium comprised experts who submitted questions, contributed to dataset refinement, and assisted with evaluations. The benchmark was structured to ensure questions operate at the frontier of human knowledge while remaining closed-ended and objectively gradable.

Results

State-of-the-art LLMs demonstrated low accuracy on HLE, revealing a substantial gap between current model capabilities and expert human performance on closed-ended academic questions. The models also exhibited poor calibration on the benchmark, indicating limitations not only in raw performance but also in confidence estimation. These results contrast sharply with the high accuracy rates achieved on existing popular benchmarks, confirming that HLE successfully establishes a more challenging evaluation standard. The low performance across multiple state-of-the-art systems suggests that current LLM architectures and training approaches have not yet reached expert-level competence on specialized academic knowledge, despite their strong performance on broader, less demanding assessments.

Implications

The introduction of HLE addresses a critical gap in AI evaluation by providing a benchmark that can meaningfully differentiate between state-of-the-art systems and track progress toward expert-level capabilities. The marked performance gap identified between current LLMs and the expert human frontier on closed-ended academic questions has implications for both research direction and policy formulation. By establishing a higher difficulty threshold, HLE enables more informed measurement of genuine advances in LLM capabilities rather than incremental improvements on saturated benchmarks. The public release of the benchmark supports ongoing research efforts to develop more capable AI systems and provides policymakers with clearer metrics for understanding the current state and limitations of LLM technology. The benchmark's design principles—emphasizing verifiable solutions at the frontier of human knowledge—may inform future evaluation methodologies as AI capabilities continue to evolve.

Disclosure

  • Research title: A benchmark of expert-level academic questions to assess AI capabilities
  • Authors: Long Phan, Alice Gatti, Nathaniel Li, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Paul Vicol, Mantas Mazeika, Dan Hendrycks, Scale AI, Ziwen Han
  • Publication date: 2026-01-28
  • DOI: https://doi.org/10.1038/s41586-025-09962-4
  • OpenAlex record: View
  • PDF: Download
  • Image credit: Photo by Yanhao Fang on Unsplash (SourceLicense)
  • Disclosure: This post was generated by artificial intelligence. The original authors did not write or review this post.