What the study found
The study found that current large language models (LLMs) do better with explicit time references, such as exact dates, than with implicit or vague ones, such as "yesterday" or "recently." Performance also declines as the number of past events in a set increases, and vague references are the hardest category overall.
Why the authors say this matters
The authors say TRAVELER fills a gap by providing a benchmark for evaluating whether LLMs can resolve explicit, implicit, and vague temporal references. The study suggests this benchmark can be used to test event-temporal reasoning in other models beyond the four evaluated here.
What the researchers tested
The researchers introduced TRAVELER, a synthetic question-answering benchmark for temporal reasoning over past events. It contains 3,300 English questions, generated from event sets of 5 to 100 household-domain events using four templates, and it includes ground-truth answers for vague references based on human surveys on Prolific.
What worked and what didn't
All four state-of-the-art LLMs tested could answer questions about event sets with only a handful of events and explicit temporal references successfully. Performance clearly deteriorated with larger event sets and with less explicit temporal references, and the vague question category showed the lowest performance across all models.
What to keep in mind
The abstract describes a synthetic benchmark in a household domain, so the reported results apply to that setup. The summary does not provide additional limitations beyond the benchmark design and the fact that only four LLMs were tested.
Key points
- TRAVELER is a benchmark for evaluating temporal reasoning over explicit, implicit, and vague references.
- The benchmark includes 3,300 English questions generated from household-domain event sets with 5 to 100 events.
- Four state-of-the-art LLMs performed well on small event sets and explicit temporal references.
- Performance declined as event sets grew larger and as temporal references became less explicit.
- Vague temporal references produced the lowest performance across all tested models.
Disclosure
- Research title:
- TRAVELER benchmark reveals weaker LLM temporal reasoning with vague references
- Authors:
- Svenja Kenneweg, Jörg Deigmöller, Philipp Cimiano, Julian Eggert
- Institutions:
- Bielefeld University, Honda (Germany)
- Publication date:
- 2026-04-22
- OpenAlex record:
- View
- Image credit:
- Photo by Mikhail Nilov on Pexels · Pexels License
Get the weekly research newsletter
Stay current with peer-reviewed research without reading academic papers — one filtered digest, every Friday.


