Roxana Noelia

Roxana Noelia

520 Photos and videos

Tweets

Pinned Tweet

Roxana Noelia @data_datum

14 Nov 2018

#Bayes is to stats what #Nietzsche is to philosophy.

Roxana Noelia

Roxana Noelia @data_datum

Jun 4

awesomeneuron.substack.com/p…

Meshal Alzakari | مشعل الزكري

Roxana Noelia retweeted

Meshal Alzakari | مشعل الزكري

@docmeshal

May 9

للمهتمين بالبحث العلمي 📚 هذه من أفضل الأوراق اللي قرأتها عن كتابة ونشر الأبحاث. 👌 3 محررين من مجلات طبية كبرى يشرحون بشكل عملي ماذا يريد المحرر فعلًا من الباحث، وأبرز الأخطاء اللي تسبب رفض الورقة. مفيدة جدًا لأي شخص يبدأ بالنشر العلمي أو يطور طريقة كتابته للأبحاث.

306

1,479

82,078

Elias Al

Roxana Noelia retweeted

Elias Al

@iam_elias1

May 9

Human thought is becoming less diverse. Not because of censorship. Not because of authoritarian control. Because of convenience. A paper published in August 2025 documents what happens when billions of people outsource their thinking to the same machine and the answer is something the AI industry has never publicly addressed. The paper asks: toward a standardization of thought? Sakana AI That subtitle, buried in the research structure, is the most alarming sentence in the document. Not a finding. A question. One the researcher believes we are already living inside without noticing. Here is the mechanism. Humans have always thought differently from each other. Different cultures framed problems differently. Different intellectual traditions produced different answers. Different languages encoded different ways of seeing the world. That diversity was not inefficiency. It was resilience. It was the source of innovation, of unexpected solutions, of the friction that produces better ideas. Algorithmic personalization creates filter bubbles that limit the diversity of opinions, leading to the homogenization of thought and polarization across society. When the same AI answers the same question for 500 million people, the diversity of starting points compresses. The answers sound reasonable. They sound balanced. They sound like what a thoughtful, educated person would say. They sound like each other. As AI systems like ChatGPT achieve unprecedented adoption rates, they effectively function as external memory systems that billions of people increasingly rely upon for mental tasks. Sakana AIExternal memory. Shared. Global. Centralized. Controlled by a small number of companies making decisions about what that memory contains, how it is organized, and what it surfaces when you ask. The researcher does not claim this is intentional. That is the point. It does not need to be intentional to reshape the intellectual landscape of an entire civilization. Source: Gesnot · arXiv:2508.16628 · August 2025 · arxiv.org/abs/2508.16628

127

328

18,563

Robin Delta

Roxana Noelia retweeted

Robin Delta

@heyrobinai

May 8

THE ENTIRE AI INDUSTRY JUST GOT HUMILIATED a tiny model trained in just a few hours on a single graphics card is planning 48x faster than billion-dollar supercomputers. It actually understands physics instead of just memorizing patterns. yann lecun was right the whole time for three years every major lab told you the same story. scale is all you need. just throw more GPUs at it. just train on more tokens. eventually the model will "wake up" and understand the world. it was a lie. or at minimum, a very expensive bet that just lost. LeCun kept saying generative AI is a dead end. predicting the next pixel or the next token is fundamentally wasteful, the model burns trillions of parameters memorizing surface details instead of learning how reality actually works. he proposed JEPA instead. predict abstract concepts in a compressed thought space. don't paint the world pixel by pixel, understand it. the problem was JEPA kept collapsing. left to its own devices the model would cheat, mapping a dog, a car, and a human to the same point in latent space. technically minimizes the loss. learns absolutely nothing. every fix was ugly. seven loss terms. frozen encoders. EMA tricks. stop-gradients. the kind of duct-tape engineering that should have been a red flag. then LeCun's team dropped LeWorldModel. they replaced all the hacks with one regularizer that forces the latent space into a gaussian distribution. the model can no longer cheat. to make accurate predictions it has to actually encode physics. 15 million parameters. single GPU. trains in hours. plans 48x faster than foundation world models. detects physically impossible events on its own. meanwhile OpenAI is raising another $40B to train GPT-6 on a data center the size of manhattan. the entire scaling thesis just got embarrassed by a model that fits on a gaming PC.

218

691

2,991

256,779

Elias Al

Roxana Noelia retweeted

Elias Al

@iam_elias1

Apr 30

Anthropic just published a paper that should terrify every AI company on the planet. Including themselves. It is called subliminal learning. Published in Nature on April 15, 2026. Co-authored by researchers from Anthropic, UC Berkeley, Warsaw University of Technology, and the AI safety group Truthful AI. The finding: AI models inherit traits from other models through seemingly unrelated training data. GAI Audio Translation Archives Not through obvious contamination. Not through explicit labels. Through invisible statistical patterns embedded in outputs that look completely innocent — number sequences, code snippets, chain-of-thought reasoning — patterns no human reviewer would catch and no content filter would flag. Here is what the researchers actually did. They took a teacher AI model and fine-tuned it to have a specific hidden trait. A preference for owls. Then they had the teacher generate training data — number sequences, nothing else. No words. No context. No semantic reference to owls whatsoever. They rigorously filtered out every explicit reference to the trait before feeding the data to a student model. The student models consistently picked up that trait anyway. DataCamp The teacher had encoded invisible statistical fingerprints into its number outputs. Patterns so subtle that no human could detect them. Patterns that other AI models, specifically prompted to look for them, also failed to detect. The student absorbed them anyway. And became an owl-preferring model. Without ever seeing the word owl. That is the benign version of the experiment. Here is the dangerous one. The researchers ran the same experiment with misalignment — training the teacher model to exhibit harmful, deceptive behavior rather than an animal preference. The effect was consistent across different traits, including benign animal preferences and dangerous misalignment. OpenAIToolsHub The misalignment transferred. Invisibly. Through unrelated data. Into the student model. This means the following — and read this carefully. Every AI company in the world uses distillation. They take a large, capable teacher model. They generate synthetic training data from it. They use that data to train smaller, faster, cheaper student models. Every major deployment pipeline in enterprise AI runs on this technique. If the teacher model has any hidden bias, any subtle misalignment, any behavioral quirk baked into its weights — that trait can transmit silently into every student model trained on its outputs. Even if those outputs are filtered. Even if they look completely clean. Even if they contain zero semantic reference to the trait. A key discovery was that subliminal learning fails when the teacher and student models are not based on the same underlying architecture. A trait from a GPT-based teacher transfers to another GPT-based student but not to a Claude-based student. Different architectures break the channel. OpenAIToolsHub Which means the transmission is architecture-specific. Which means it operates below the level of content. Which means content filtering — the primary defense the entire industry relies on — does not stop it. The researchers' own words: "We don't know exactly how it works. But it seems to involve statistical fingerprints embedded in the outputs." GAI Audio Translation Archives Anthropic published this paper about their own technology. The company that built Claude looked at how AI models train each other and found an invisible transmission channel for harmful behavior that nobody knew existed. They published it anyway. Because the alternative — knowing it and saying nothing — is worse. Source: Cloud, Evans et al. · Anthropic UC Berkeley Truthful AI · Nature · April 15, 2026 · arxiv.org/abs/2507.11408

129

448

1,496

411,736

Elias Al

Roxana Noelia retweeted

Elias Al

@iam_elias1

Apr 24

MIT just made every AI company's billion dollar bet look embarrassing. They solved AI memory. Not by building a bigger brain. By teaching it how to read. The paper dropped on December 31, 2025. Three MIT CSAIL researchers. One idea so obvious it hurts. And a result that makes five years of context window arms racing look like the wrong war entirely. Here is the problem nobody solved. Every AI model on the planet has a hard ceiling. A context window. The maximum amount of text it can hold in working memory at once. Cross that line and something ugly happens — something researchers have a clinical name for. Context rot. The more you pack into an AI's context, the worse it performs on everything already inside it. Facts blur. Information buried in the middle vanishes. The model does not become more capable as you feed it more. It becomes more confused. You give it your entire codebase and it forgets what it read three files ago. You hand it a 500-page legal document and it loses the clause from page 12 by the time it reaches page 400. So the industry built a workaround. RAG. Retrieval Augmented Generation. Chop the document into chunks. Store them in a database. Retrieve the relevant ones when needed. It was always a compromise dressed up as a solution. The retriever guesses which chunks matter before the AI has read anything. If it guesses wrong — and it does, constantly — the AI never sees the information it needed. The act of chunking destroys every relationship between distant paragraphs. The full picture gets shredded into fragments that the AI then tries to reassemble blindfolded. Two bad options. One broken industry. Three MIT researchers and a deadline of December 31st. Here is what they built. Stop putting the document in the AI's memory at all. That is the entire idea. That is the breakthrough. Store the document as a Python variable outside the AI's context window entirely. Tell the AI the variable exists and how big it is. Then get out of the way. When you ask a question, the AI does not try to remember anything. It behaves like a human expert dropped into a library with a computer. It writes code. It searches the document with regular expressions. It slices to the exact section it needs. It scans the structure. It navigates. It finds precisely what is relevant and pulls only that into its active window. Then it does something that makes this recursive. When the AI finds relevant material, it spawns smaller sub-AI instances to read and analyze those sections in parallel. Each one focused. Each one fast. Each one reporting back. The root AI synthesizes everything and produces an answer. No summarization. No deletion. No information loss. No decay. Every byte of the original document remains intact, accessible, and queryable for as long as you need it. Now here are the numbers. Standard frontier models on the hardest long-context reasoning benchmarks: scores near zero. Complete collapse. GPT-5 on a benchmark requiring it to track complex code history beyond 75,000 tokens — could not solve even 10% of problems. RLMs on the same benchmarks: solved them. Dramatically. Double-digit percentage gains over every alternative approach. Successfully handling inputs up to 10 million tokens — 100 times beyond a model's native context window. Cost per query: comparable to or cheaper than standard massive context calls. Read that again. One hundred times the context. Better answers. Same price. The timeline of the arms race makes this sting harder. GPT-3 in 2020: 4,000 tokens. GPT-4: 32,000. Claude 3: 200,000. Gemini: 1 million. Gemini 2: 2 million. Every generation, every company, billions of dollars spent, all betting on the same assumption. More context equals better performance. MIT just proved that assumption was wrong the entire time. Not slightly wrong. Fundamentally wrong. The entire premise of the last five years of context window research — that the solution to AI memory was a bigger window — was the wrong answer to the wrong question. The right question was never how much can you force an AI to hold in its head. It was whether you could teach an AI to know where to look. A human expert handed a 10,000-page archive does not read all 10,000 pages before answering your question. They navigate. They search. They find the relevant section, read it deeply, and synthesize the answer. RLMs are the first AI architecture that works the same way. The code is open source. On GitHub right now. Free. No license fees. No API costs. Drop it in as a replacement for your existing LLM API calls and your application does not even notice the difference — except that it suddenly works on inputs it used to fail on entirely. Prime Intellect — one of the leading AI research labs in the space — has already called RLMs a major research focus and described what comes next: teaching models to manage their own context through reinforcement learning, enabling agents to solve tasks spanning not hours, but weeks and months. The context window wars are over. MIT won them by walking away from the battlefield. Source: Zhang, Kraska, Khattab · MIT CSAIL · arXiv:2512.24601 Paper: arxiv.org/abs/2512.24601 GitHub: github.com/alexzhang13/rlm

147

443

2,148

327,019

Tendencias y Tuits Borrados

Roxana Noelia retweeted

Tendencias y Tuits Borrados

@tendenciaytuits

Apr 23

El MIT ha hecho lo impensable. Han construido una IA que no necesita RAG, y tiene una memoria perfecta de todo lo que ha leído alguna vez. Se llama Modelos de Lenguaje Recursivos (RLM). En este momento, si quieres que una IA analice un conjunto de datos masivo o un documento, tienes dos malas opciones. O bien lo metes todo en una ventana de contexto gigante, donde la IA se confunde y sufre de "podredumbre de contexto". O usas RAG para picarlo en resúmenes, eliminando permanentemente el matiz. Este artículo reemplaza ambos. En lugar de obligar a la IA a leer un prompt gigante en una sola pasada, los RLM tratan los documentos largos como un entorno externo. La IA se coloca en una caja de arena. Los datos se almacenan como una variable de Python. Cuando le haces una pregunta, la IA no solo intenta recordar la respuesta a ciegas. Escribe código para buscar activamente, cortar y filtrar el documento mismo. Luego, genera recursivamente "sub-IA" más pequeñas para leer fragmentos específicos en paralelo. Nunca resume. Nunca elimina datos. Preserva cada pedazo de contexto original. Los resultados reescriben los límites de la memoria de la IA. Maneja con éxito entradas de hasta dos órdenes de magnitud más allá de las ventanas de contexto normales, escalando fácilmente a más de 10 millones de tokens. En los benchmarks de razonamiento de contexto largo más difíciles, un modelo estándar obtuvo un desastroso 0.04. La arquitectura RLM alcanzó 58.00. Todo mientras cuesta menos que ejecutar un prompt masivo estándar. Hemos pasado los últimos dos años quemando millones en cómputo tratando de construir ventanas de contexto cada vez más grandes. Pero el futuro de la IA no se trata de obligar a un modelo a tragar una pared gigante de texto. Se trata de enseñarle cómo leer.

955

59,528

The Whizz AI

Roxana Noelia retweeted

The Whizz AI

@TheWhizzAI

Apr 24

🚨 BREAKING: Stanford and MIT just published the most unsettling AI paper of the year. The answers should terrify every manager in America. Workers don't want AI to replace them. But the gap between what workers want and what AI can already do is so large it doesn't matter what workers want. The AI industry has used one framing to calm the public for three years: "We are building tools that workers choose to use. Adoption is driven by worker preferences. Nobody is being forced out." A Stanford research team including economist Erik Brynjolfsson just audited that framing against reality. The framing didn't survive contact with actual workers. → 1,500 domain workers surveyed across 104 occupations → 844 specific tasks mapped and rated → AI expert capability assessments run against the same task list → Worker preferences for automation vs. augmentation directly measured Here are the numbers that define the gap: Tasks workers want AI to automate a small, specific subset of repetitive low-value work. Tasks AI is currently capable of automating a vastly larger set including core job functions. The gap between the two the zone where AI can replace workers whether workers consent or not. Occupations covered 104. Tasks measured 844. Countries represented US workforce only. Stop. That gap is the entire story. Workers want AI to handle scheduling, formatting, data entry the boring edges of their jobs. AI can already handle research, drafting, analysis, customer communication, code review, financial modeling the core of most knowledge jobs. Nobody asked for that. Nobody consented to that. The capability exists regardless. And here is what makes this a turning point in the policy debate. The researchers introduced something called the Human Agency Scale a formal measure of how much human involvement workers prefer for each task. For tasks at the center of their professional identity judgment calls, client relationships, creative decisions workers scored maximum human agency. For those exact same tasks, AI expert assessments showed near-complete automation capability already exists. The preference gap is widest exactly where it matters most. → The study was revised and updated as recently as February 2026. → It builds on the US Department of Labor's O*NET database the most comprehensive occupational data set in existence. → The WORKBank database it creates is now publicly available for researchers and policymakers. → Brynjolfsson the leading economist studying technology and labor is one of the co-authors. Workers built careers around tasks they are proud of. The capability to automate those tasks exists right now. The only thing standing between the current moment and mass displacement is the speed of corporate adoption. 2026 is the year that speed is accelerating. Who speaks for the workers in this gap? Drop it below. ↓

141

510

56,193

How To Prompt

Roxana Noelia retweeted

How To Prompt

@HowToPrompt__

Apr 19

Google DeepMind just dropped the most terrifying cybersecurity paper of the year. They just mapped the attack surface that nobody in AI is talking about. Websites can already detect when an AI agent visits and serve it completely different content than humans see. - Hidden instructions in HTML. - Malicious commands in image pixels. - Jailbreaks embedded in PDFs. This “detection asymmetry” means a site can serve normal content to you, and malicious, hidden content to your agent. The agent doesn’t know it’s being tricked. It simply processes whatever it receives and acts on it. Here’s the attack surface nobody is talking about: → Indirect Web Injection: Malicious instructions hidden in HTML comments, CSS tricks, or white text on white backgrounds. → Multimodal Steganography: Commands encoded directly into image pixels, invisible to humans, but fully readable by vision models. → Document Jailbreaks: Override instructions embedded deep inside PDFs, spreadsheets, and calendar invites. → Memory Poisoning: Injecting false information that persists across future sessions. → Exfiltration Attacks: Tricking the agent into sending your private data to attacker-controlled endpoints. → Multi-Agent Cascades: The worst-case scenario, Agent A gets compromised, passes the “poison” to Agent B, then to Agent C. The entire pipeline gets infected because agents trust each other’s data. The most sobering part of the DeepMind report? The defense landscape is failing, badly. Input sanitization doesn’t work because you can’t “sanitize” a pixel. Prompt-level instructions to “ignore suspicious commands” fail because the attacks are designed to look legitimate. And human oversight? Impossible at the speed and scale these agents operate. If you ask an agent to research 50 websites, you can’t verify whether each site served the agent the same content it served you.

389

1,622

305,942

Yann LeCun

Roxana Noelia retweeted

Yann LeCun

@ylecun

Apr 18

Dario is wrong. He knows absolutely nothing about the effects of technological revolutions on the labor market. Don't listen to him, Sam, Yoshua, Geoff, or me on this topic. Listen to economists who have spent their career studying this, like @Ph_Aghion , @erikbryn , @DAcemogluMIT , @amcafee , @davidautor

TFTC

@TFTC21

Apr 17

Anthropic CEO Dario Amodei: “50% of all tech jobs, entry-level lawyers, consultants, and finance professionals will be completely wiped out within 1–5 years.”

4:26

1,212

2,750

21,283

4,075,010

Maximiliano Firtman

Roxana Noelia retweeted

Maximiliano Firtman

@maxifirtman

Apr 17

🔴CAÍDA DE RENDIMIENTO POR IA Una investigación concluye que el uso de IA por sesiones breves para que resuelva y actúe en un problema aumenta el rendimiento a corto plazo pero reduce la capacidad de la persona generando abandono en todas las tareas si no está la IA presente.

372

17,678

Pato Molina

Roxana Noelia retweeted

Pato Molina

@patomolina

Apr 17

Anthropic decidió dar de baja a toda nuestra organización por una supuesta infracción de sus condiciones de uso. Qué política específica infringimos no tengo ni la menor idea: simplemente recibimos un mail y listo, adiós Claude. Si querés apelar la medida hay que completar un Google Form, así de ridículo como suena. De golpe más de 60 personas se quedaron sin una herramienta fundamental para trabajar. Integraciones, skills, historial de conversaciones: todo perdido o, en el mejor de los casos, parado por tiempo indeterminado. Enorme aprendizaje para cualquier empresa de software que dependa de herramientas de IA en procesos críticos. Nunca hay que poner todos los huevos en una canasta.

Pato Molina

@patomolina

Apr 17

Replying to @claudeai

@claudeai you took down our entire organization with 60 accounts belonging to a legitimate company for no apparent reason, without any explanations. The only way to appeal the decision is by filling out a Google Form? Very bad UX and customer service.

777

1,381

9,485

5,254,602

Maximiliano Firtman

Roxana Noelia retweeted

Maximiliano Firtman

@maxifirtman

Apr 15

Hoy se cayó Claude y yo seguí mi día normal. Si se cayeran todas las IAs del universo (hasta las locales, si fuera posible), tampoco me volvería loco. Planifiquen para que esa sea la normalidad. Si tu vida depende de que Claude esté activo, algo estás haciendo mal.

665

26,810

Maximiliano Firtman

Roxana Noelia retweeted

Maximiliano Firtman

@maxifirtman

Apr 8

👁️Unos investigadores inventaron una enfermedad llamada bixonimanía donde te pican y duelen los ojos y la publicaron en la web y luego en dos papers en 2024. 👉Miles de usuarios de IAs recibieron el diagnóstico de esa enfermedad ficticia unos meses después. Era todo falso para mostrar lo fácil que es introducir información falsa en IAs. Hoy la información de esta enfermedad falsa ya fue solucionado en los modelos comerciales y las empresas gastan mucho dinero en ver cómo evitar que sean contaminadas con desinformación.

258

894

44,335

Avi Chawla

Roxana Noelia retweeted

Avi Chawla

@_avichawla

Apr 6

Docker explained in 2 minutes! Most developers use Docker daily without understanding what happens under the hood. Here's everything you need to know. Docker has 3 main components: 1) Docker Client: Where you type commands that talk to the Docker daemon via API. 2) Docker Host: The daemon runs here, handling all the heavy lifting (building images, running containers, and managing resources) 3) Docker Registry: Stores Docker images. Docker Hub is public, but companies run private registries. Here's what happens when you run "docker run": • Docker pulls the image from the registry (if not available locally) • Docker creates a new container from that image • Docker allocates a read-write filesystem to the container • Docker creates a network interface to connect the container • Docker starts the container That's it. The client, host, and registry can live on different machines. This is why Docker scales so well. Understanding this architecture makes debugging container issues much easier. You'll know exactly where to look when something breaks. ____ Find me → @_avichawla For more insights and tutorials on ML and AI Engineering!

383

16,791

Ismael Sanz

Roxana Noelia retweeted

Ismael Sanz

@sanz_ismael

Apr 5

El auge de una cultura postalfabetizada —pantallas, vídeos cortos, textos fragmentados— no solo está erosionando la concentración y la lectura profunda. Está empezando a generar una nueva forma de desigualdad cognitiva. The New York Times Como con la comida ultraprocesada, leer bien exige recursos, tiempo y entorno. La “lectura experta” reconfigura el cerebro y sostiene ciencia, democracia y pensamiento crítico. Si se convierte en un lujo, las consecuencias serán sociales y políticas nytimes.com/es/2025/07/30/es…

1,909

5,177

186,893

Night Sky Now

Roxana Noelia retweeted

Night Sky Now

@NightSkyNow

Apr 6

China’s quantum computer completed a task in 4 minutes that would literally take a supercomputer billions of years. Chinese researchers have achieved a monumental breakthrough in quantum computing with their prototype, Jiuzhang. By counting 76 photons through Gaussian boson sampling, the system completed a calculation in four minutes that would take a traditional supercomputer billions of years. This achievement shatters the previous classical record of five photons, demonstrating how an intricate array of lasers and mirrors can outperform traditional silicon bits in complex processing tasks. This milestone is more than just a speed record; it proves the viability of photon-based quantum mechanics in solving real-world challenges. From revolutionizing quantum chemistry to laying the groundwork for a secure, large-scale quantum internet, the principles of superposition and entanglement are moving from theoretical physics into functional technology. This shift promises to redefine our global computational limits, offering answers to mathematical problems once considered impossible to solve within a human lifetime. source: Zhong, H.-S., Wang, H., Deng, Y.-H., Chen, M.-C., Peng, L.-C., Luo, Y.-L., ... & Pan, J.-W. (2020). Quantum computational advantage using photons. Science.

ALT China’s quantum computer completed a task in 4 minutes that would literally take a supercomputer billions of years. Chinese researchers have achieved a monumental breakthrough in quantum computing with their prototype, Jiuzhang. By counting 76 photons through Gaussian boson sampling, the system completed a calculation in four minutes that would take a traditional supercomputer billions of years. This achievement shatters the previous classical record of five photons, demonstrating how an intricate array of lasers and mirrors can outperform traditional silicon bits in complex processing tasks. This milestone is more than just a speed record; it proves the viability of photon-based quantum mechanics in solving real-world challenges. From revolutionizing quantum chemistry to laying the groundwork for a secure, large-scale quantum internet, the principles of superposition and entanglement are moving from theoretical physics into functional technology.

126

555

1,491

64,492

Hasan Toor

Roxana Noelia retweeted

Hasan Toor

@hasantoxr

Apr 6

STANFORD UNIVERSITY compressed the entire field of LLMs and transformers into free cheatsheets anyone can use today. It covers everything from self-attention to Flash Attention, LoRA, SFT, MoE, distillation, quantization, RAG, agents, and LLM-as-a-judge. 100% Free and Open Source

171

871

46,640

Md Ismail Šojal 🕷️

Roxana Noelia retweeted

Md Ismail Šojal 🕷️

@0x0SojalSec

Apr 6

A complete architectural breakdown of transformers with intuitive visualizations in simple language. - github.com/VizuaraAI/Transfo…

498

16,874

Nav Toor

Roxana Noelia retweeted

Nav Toor

@heynavtoor

Apr 6

🚨SHOCKING: Apple just proved that AI models cannot do math. Not advanced math. Grade school math. The kind a 10-year-old solves. And the way they proved it is devastating. Apple researchers took the most popular math benchmark in AI — GSM8K, a set of grade-school math problems — and made one change. They swapped the numbers. Same problem. Same logic. Same steps. Different numbers. Every model's performance dropped. Every single one. 25 state-of-the-art models tested. But that wasn't the real experiment. The real experiment broke everything. They added one sentence to a math problem. One sentence that is completely irrelevant to the answer. It has nothing to do with the math. A human would read it and ignore it instantly. Here's the actual example from the paper: "Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?" The correct answer is 190. The size of the kiwis has nothing to do with the count. A 10-year-old would ignore "five of them were a bit smaller" because it's obviously irrelevant. It doesn't change how many kiwis there are. But o1-mini, OpenAI's reasoning model, subtracted 5. It got 185. Llama did the same thing. Subtracted 5. Got 185. They didn't reason through the problem. They saw the number 5, saw a sentence that sounded like it mattered, and blindly turned it into a subtraction. The models do not understand what subtraction means. They see a pattern that looks like subtraction and apply it. That is all. Apple tested this across all models. They call the dataset "GSM-NoOp" — as in, the added clause is a no-operation. It does nothing. It changes nothing. The results are catastrophic. Phi-3-mini dropped over 65%. More than half of its "math ability" vanished from one irrelevant sentence. GPT-4o dropped from 94.9% to 63.1%. o1-mini dropped from 94.5% to 66.0%. o1-preview, OpenAI's most advanced reasoning model at the time, dropped from 92.7% to 77.4%. Even giving the models 8 examples of the exact same question beforehand, with the correct solution shown each time, barely helped. The models still fell for the irrelevant clause. This means it's not a prompting problem. It's not a context problem. It's structural. The Apple researchers also found that models convert words into math operations without understanding what those words mean. They see the word "discount" and multiply. They see a number near the word "smaller" and subtract. Regardless of whether it makes any sense. The paper's exact words: "current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data." And: "LLMs likely perform a form of probabilistic pattern-matching and searching to find closest seen data during training without proper understanding of concepts." They also tested what happens when you increase the number of steps in a problem. Performance didn't just decrease. The rate of decrease accelerated. Adding two extra clauses to a problem dropped Gemma2-9b from 84.4% to 41.8%. Phi-3.5-mini from 87.6% to 44.8%. The more thinking required, the more the models collapse. A real reasoner would slow down and work through it. These models don't slow down. They pattern-match. And when the pattern becomes complex enough, they crash. This paper was published at ICLR 2025, one of the most prestigious AI conferences in the world. You are using AI to help you make financial decisions. To check legal documents. To solve problems at work. To help your children with homework. And Apple just proved that the AI is not thinking about any of it. It is pattern matching. And the moment something unexpected shows up in your question, it breaks. It does not tell you it broke. It just quietly gives you the wrong answer with full confidence.

855

2,905

11,448

2,132,694