TRAVELER benchmark reveals weaker LLM temporal reasoning with vague references

Research area:Natural language processingArtificial IntelligenceBenchmark (surveying)

What the study found

The study found that current large language models (LLMs) do better with explicit time references, such as exact dates, than with implicit or vague ones, such as "yesterday" or "recently." Performance also declines as the number of past events in a set increases, and vague references are the hardest category overall.

Why the authors say this matters

The authors say TRAVELER fills a gap by providing a benchmark for evaluating whether LLMs can resolve explicit, implicit, and vague temporal references. The study suggests this benchmark can be used to test event-temporal reasoning in other models beyond the four evaluated here.

What the researchers tested

The researchers introduced TRAVELER, a synthetic question-answering benchmark for temporal reasoning over past events. It contains 3,300 English questions, generated from event sets of 5 to 100 household-domain events using four templates, and it includes ground-truth answers for vague references based on human surveys on Prolific.

What worked and what didn't

All four state-of-the-art LLMs tested could answer questions about event sets with only a handful of events and explicit temporal references successfully. Performance clearly deteriorated with larger event sets and with less explicit temporal references, and the vague question category showed the lowest performance across all models.

What to keep in mind

The abstract describes a synthetic benchmark in a household domain, so the reported results apply to that setup. The summary does not provide additional limitations beyond the benchmark design and the fact that only four LLMs were tested.

Key points

TRAVELER is a benchmark for evaluating temporal reasoning over explicit, implicit, and vague references.
The benchmark includes 3,300 English questions generated from household-domain event sets with 5 to 100 events.
Four state-of-the-art LLMs performed well on small event sets and explicit temporal references.
Performance declined as event sets grew larger and as temporal references became less explicit.
Vague temporal references produced the lowest performance across all tested models.

Disclosure

Research title:: TRAVELER benchmark reveals weaker LLM temporal reasoning with vague references
Authors:: Svenja Kenneweg, Jörg Deigmöller, Philipp Cimiano, Julian Eggert
Institutions:: Bielefeld University, Honda (Germany)
Publication date:: 2026-04-22
DOI:: 10.1007/s42979-026-04973-y
OpenAlex record:: View
Image credit:: Photo by Mikhail Nilov on Pexels · Pexels License

AI provenance: This post was generated by OpenAI. The original authors did not write or review this post.

TRAVELER benchmark reveals weaker LLM temporal reasoning with vague references

What the study found

Why the authors say this matters

What the researchers tested

What worked and what didn't

What to keep in mind

Disclosure

More posts

Bluem Constant defines a baseline for minority-driven sensitivity

EAC-Net predicts charge density with high accuracy

Base-60 digits map to exact prime-factor lattice coordinates

TRAVELER benchmark reveals weaker LLM temporal reasoning with vague references

What the study found

Why the authors say this matters

What the researchers tested

What worked and what didn't

What to keep in mind

Disclosure

More posts

Bluem Constant defines a baseline for minority-driven sensitivity

EAC-Net predicts charge density with high accuracy

Base-60 digits map to exact prime-factor lattice coordinates

Base 60 is the minimal admissible base for icosahedral symmetry