Publishing process signals: MODERATE — reflects the venue and review process. — venue and review process.

TRAVELER benchmark reveals weaker LLM temporal reasoning with vague references

—

Research area:Natural language processingArtificial IntelligenceBenchmark (surveying)

What the study found

The study found that current large language models (LLMs) do better with explicit time references, such as exact dates, than with implicit or vague ones, such as "yesterday" or "recently." Performance also declines as the number of past events in a set increases, and vague references are the hardest category overall.

Why the authors say this matters

The authors say TRAVELER fills a gap by providing a benchmark for evaluating whether LLMs can resolve explicit, implicit, and vague temporal references. The study suggests this benchmark can be used to test event-temporal reasoning in other models beyond the four evaluated here.

What the researchers tested

The researchers introduced TRAVELER, a synthetic question-answering benchmark for temporal reasoning over past events. It contains 3,300 English questions, generated from event sets of 5 to 100 household-domain events using four templates, and it includes ground-truth answers for vague references based on human surveys on Prolific.

What worked and what didn't

All four state-of-the-art LLMs tested could answer questions about event sets with only a handful of events and explicit temporal references successfully. Performance clearly deteriorated with larger event sets and with less explicit temporal references, and the vague question category showed the lowest performance across all models.

What to keep in mind

The abstract describes a synthetic benchmark in a household domain, so the reported results apply to that setup. The summary does not provide additional limitations beyond the benchmark design and the fact that only four LLMs were tested.

Key points

TRAVELER is a benchmark for evaluating temporal reasoning over explicit, implicit, and vague references.
The benchmark includes 3,300 English questions generated from household-domain event sets with 5 to 100 events.
Four state-of-the-art LLMs performed well on small event sets and explicit temporal references.
Performance declined as event sets grew larger and as temporal references became less explicit.
Vague temporal references produced the lowest performance across all tested models.

Disclosure

Research title:: TRAVELER benchmark reveals weaker LLM temporal reasoning with vague references
Image credit:: Photo by Mikhail Nilov on Pexels

AI provenance: AI provenance information is not available for this post.