13/ Overall, we found no evidence of formal reasoning in language models including open-source models like
#Llama,
#Phi,
#Gemma, and
#Mistral and leading closed models, including the recent
#OpenAI #GPT-4o and
#o1-series. Their behavior is better explained by sophisticated pattern matching—so fragile, in fact, that changing names can alter results by ~10%! We can scale data, parameters, and compute—or use better training data for Phi-4, Llama-4, GPT-5. But we believe this will result in 'better pattern-matchers,' not necessarily 'better reasoners.
Check out the full paper to find out more:
arxiv.org/pdf/2410.05229
Also stay tuned for the data release!