Hard tasks expose where LLM agents still break 🧱
In our LLM evals, several reasoning-heavy tasks are near-zero / zero success:
GraphColoring, LightsOut, SwitchCircuit, ProgramSynthesis, and SymbolMatching.
Sequential search state tracking is still very unsolved.