LongCoT:确定长视链的基线
LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning
作者
Authors
Sumeet Ramesh Motwani | Daniel Nichols | Charles London | Peggy Li | Fabio Pizzati | Acer Blake | Hasan Hammoud | Tavish McDonald | Akshat Naik | Alesia Ivanova | Vignesh Baskaran | Ivan Laptev | Ruben Glatt | Tal Ben-Nun | Philip Torr | Natasha Jaques | Ameya Prabhu | Brian Bartoldson | Bhavya Kailkhura | Christian Schroeder de Witt
期刊
Journal
暂无期刊信息
年份
Year
2026
分类
Category
国家
Country
-
📝 摘要
Abstract
As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this ability is planning and managing a long, complex chain-of-thought (CoT). We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT reasoning capabilities of frontier models. Problems consist of a short input with a verifiable answer; solving them requires navigating a graph of interdependent steps that span tens to hundreds of thousands of reasoning tokens. Each local step is individually tractable for frontier models, so failures reflect long-horizon reasoning limitations. At release, the best models achieve <10% accuracy (GPT 5.2: 9.8%; Gemini 3 Pro: 6.1%) on LongCoT, revealing a substantial gap in current capabilities. Overall, LongCoT provides a rigorous measure of long-horizon reasoning, tracking the ability of frontier models to reason reliably over extended periods.
📊 文章统计
Article Statistics
基础数据
Basic Stats
84
浏览
Views
0
下载
Downloads
4
引用
Citations
引用趋势
Citation Trend
阅读国家分布
Country Distribution
阅读机构分布
Institution Distribution
月度浏览趋势
Monthly Views