A benchmark of expert-level academic questions to assess AI capabilities

AI Summary of Scholarly Research

This page presents an AI-generated summary of a published research paper. The original authors did not write or review this article. See full disclosure ↓

Research Explorer (The University of Manchester)·2026-01-28·View original paper ↗·Follow this topic (RSS)

Computer Science & AI Artificial Intelligence & Machine Learning

Publication Signals show what we were able to verify about where this research was published.Available publication signals for this source were verified.ⓘ Publication Signals reflect the source’s verifiable credentials, not the quality of the research.

Fewer signals were independently confirmable for this source. That reflects the limits of what’s on record — not a judgment about the research.

✔ Published in indexed journal
✔ No retraction or integrity flags
✔ Journal impact data available (H-index: 143)

Overview

Large language models have rapidly achieved high performance on existing benchmarks, with accuracy exceeding 90% on widely-used assessments such as Measuring Massive Multitask Language Understanding. This saturation limits the capacity to differentiate between state-of-the-art systems and measure continued progress in LLM capabilities. To address this limitation, researchers developed Humanity's Last Exam (HLE), a multi-modal benchmark positioned at the frontier of human knowledge and designed to evaluate expert-level academic competence across a broad range of disciplines. The benchmark comprises 2,500 questions spanning dozens of subjects including mathematics, humanities, and natural sciences. Each question features unambiguous solutions that are verifiable but not readily obtainable through simple internet retrieval, establishing a higher bar for assessment than previous benchmarks.

Methods and approach

HLE was developed through a global collaboration involving subject-matter experts who contributed questions across multiple academic domains. The benchmark employs both multiple-choice and short-answer question formats, all designed to be suitable for automated grading while maintaining expert-level difficulty. Questions were constructed to have known, unambiguous solutions that can be easily verified but require deep domain expertise rather than information retrieval. The Center for AI Safety and Scale AI consortia jointly designed the dataset premise and pipeline, operated the data collection platform, and provided funding and inference infrastructure for LLMs. The HLE Contributors Consortium comprised experts who submitted questions, contributed to dataset refinement, and assisted with evaluations. The benchmark was structured to ensure questions operate at the frontier of human knowledge while remaining closed-ended and objectively gradable.

Key Findings

State-of-the-art LLMs demonstrated low accuracy on HLE, revealing a substantial gap between current model capabilities and expert human performance on closed-ended academic questions. The models also exhibited poor calibration on the benchmark, indicating limitations not only in raw performance but also in confidence estimation. These results contrast sharply with the high accuracy rates achieved on existing popular benchmarks, confirming that HLE successfully establishes a more challenging evaluation standard. The low performance across multiple state-of-the-art systems suggests that current LLM architectures and training approaches have not yet reached expert-level competence on specialized academic knowledge, despite their strong performance on broader, less demanding assessments.

Implications

The introduction of HLE addresses a critical gap in AI evaluation by providing a benchmark that can meaningfully differentiate between state-of-the-art systems and track progress toward expert-level capabilities. The marked performance gap identified between current LLMs and the expert human frontier on closed-ended academic questions has implications for both research direction and policy formulation. By establishing a higher difficulty threshold, HLE enables more informed measurement of genuine advances in LLM capabilities rather than incremental improvements on saturated benchmarks. The public release of the benchmark supports ongoing research efforts to develop more capable AI systems and provides policymakers with clearer metrics for understanding the current state and limitations of LLM technology. The benchmark's design principles—emphasizing verifiable solutions at the frontier of human knowledge—may inform future evaluation methodologies as AI capabilities continue to evolve.

Disclosure

Research title: A benchmark of expert-level academic questions to assess AI capabilities
Authors: Long Phan, Alice Gatti, Nathaniel Li, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Paul Vicol, Mantas Mazeika, Dan Hendrycks, Scale AI, Ziwen Han
Publication date: 2026-01-28
DOI: https://doi.org/10.1038/s41586-025-09962-4
OpenAlex record: View
PDF: Download
Image credit: Photo by Yanhao Fang on Unsplash (Source • License)
Disclosure: This post was generated by Claude (Anthropic). The original authors did not write or review this post.

A benchmark of expert-level academic questions to assess AI capabilities

Overview

Methods and approach

Key Findings

Implications

Disclosure

More posts

Bluem Constant defines a baseline for minority-driven sensitivity

EAC-Net predicts charge density with high accuracy

Base-60 digits map to exact prime-factor lattice coordinates

A benchmark of expert-level academic questions to assess AI capabilities

Overview

Methods and approach

Key Findings

Implications

Disclosure

Get the weekly research newsletter

Related research in Computer Science & AI

More posts

Bluem Constant defines a baseline for minority-driven sensitivity

EAC-Net predicts charge density with high accuracy

Base-60 digits map to exact prime-factor lattice coordinates

Base 60 is the minimal admissible base for icosahedral symmetry