If models can think for 100,000 tokens, why do they still lose the plot?
Come join us for this AI4Science on alphaXiv talk: Long-Horizon Reasoning in LLMs.
In this session, Sumeet Motwani (
@sumeetrm) and Charles London (
@CharlieLondon02) will share recent work on both training and evaluating models that can reason over much longer chains of thought.
Their LongCoT benchmark tests whether models can handle long chains of dependent reasoning across different fields. Each step is solvable on its own, but the full problem requires planning, state tracking, backtracking, and avoiding compounding errors. Even the best models still score below 10%.
They will also discuss h1, which trains long-horizon reasoning by chaining short problems into longer dependency graphs, then using RL with outcome-only rewards and a gradually harder curriculum.
So if longer context windows are not enough, what does it actually take to make models reason reliably over long scientific and technical workflows?
Whether you’re working on frontier LLMs, AI4Science, reasoning, or just curious about what current models still cannot do, you should definitely check this talk out!
🗓 Friday May 15th 2026 · 11 AM PT
🎙 Featuring Sumeet Motwani and Charles London
💬 Casual Talk Open Discussion