I was very disappointed by it. They trained a small model on toy problems and complained that it could not generalize outside the problem domain?
Like if its only trained on F1(A) = N and F1(M) = Z how would we expect it to know what F1(O) is?
Another bad news for reasoning LLMs š¤
The paper claims Chain-of-Thought in Language Models, is a brittle mirage bounded by training data, which is just pattern matching rather than genuine inference. š¤Æ
Argues that chain of thought in LLMs is pattern replay bound to training data, not general reasoning.
Chain of thought is useful when prompts match training patterns, but it is not evidence of general reasoning.
When test data shifts even slightly, the step by step text stays fluent but logic cracks.
The authors build DataAlchemy, a controlled sandbox, and train small GPTā2 style models on alphabet puzzles using 2 operations, rotate letters and shift positions, over 4 letter strings.
This lets them probe 3 axes, task, length, and format.
When the test uses the same transformation pattern as training, the modelās full chain output matches the label 100%. The moment the test swaps in new compositions of those operations or a truly unseen transform, that exact match collapses to 0.01% or 0%, even though the model still writes confident step by step text. That text looks reasonable, but the final answer is wrong.
Element shift is similar, novel letter combinations or unseen letters break the chain completely.
Length shift hurts too, models trained on length 4 fail on 3 or 5 and even pad or trim steps to mimic seen length, a group padding trick helps a bit.
Format noise degrades outputs, insertions hurt more than deletions, and edits to element or transform tokens matter far more than changes to filler prompt words.
A tiny burst of supervised fine tuning, about 0.00015 of the data, quickly patches accuracy, which signals distribution coverage, not new reasoning skills.
Temperature and size, 68K to 543M parameters, barely change the pattern.
Bottom line, when the data moves, accuracy can collapse while the story still sounds fine. Test on real shifts, not just matching cases, and keep training coverage honest.
š§µ Read on š