takeUforward

takeUforward

M1D0R1

Feb 24

🚨 @AnthropicAI just exposed massive AI model theft: DeepSeek, Moonshot AI, and MiniMax ran industrial-scale "distillation attacks" on Claude to steal its capabilities. Dropped Feb 23, 2026—here's the breakdown for devs who care about this stuff. They created ~24,000 fake accounts (bypassing regional blocks via proxy "hydra clusters"—networks of fraud accounts mixing spam with legit traffic) and generated over 16 MILLION exchanges with Claude. Goal: extract high-quality outputs (esp. chain-of-thought reasoning, agentic/tool use, coding) to train/boost their own models cheaply and fast. Distillation explained quick: Legit way = take your big model → distill into smaller/faster version for users. Illicit way = query a stronger model like Claude at huge scale → collect responses → train your weaker model on that data. Skips billions in compute and years of R&D. But it strips safety alignments—US labs bake in blocks against bioweapons/cyber malice; copies often lose those → big natsec risk if they feed authoritarian surveillance, disinformation, or get open-sourced. Per-lab dirt (Anthropic attributed with high confidence via IPs, metadata, infra, industry corroboration): DeepSeek (~150k exchanges): Targeted reasoning, rubric grading (for RL rewards), censorship-safe rewrites of sensitive queries (dodging dissident/party topics). Synced traffic, load balancing, CoT prompts to gen training data. Traced to specific researchers. Moonshot AI (3.4M exchanges): Agentic reasoning, tool use, coding/data analysis, computer-use agents, even vision. Hundreds of varied fake accounts. Later tried reconstructing Claude's reasoning traces. Metadata linked to senior staff. MiniMax (13M exchanges): Agentic coding tool orchestration. Caught mid-campaign before their model launch—when Anthropic released a new version, they pivoted in <24hrs to steal from it. Matched their public roadmap. This undermines US export controls on chips/AI tech—looks like "rapid Chinese innovation" but relies on stolen US capabilities restricted hardware for scale. Anthropic's fixes: Better detection (classifiers for patterns/CoT spam, coordinated account spotting), intel sharing with labs/clouds/authorities, tighter account verification (edu/startups common exploits), model-level nerfs to make outputs less distillable without hurting real users. They say the attacks are ramping up fast—needs industry policy collab NOW. Window closing. For coders: This shows frontier model IP is way more fragile than people think. If distillation becomes standard, the race is who hides theft best. Fair global comp or red line? Full post from Anthropic (worth the read): anthropic.com/news/detecting… #AI #Claude #DistillationAttacks #AISecurity #DeepSeek #MoonshotAI #MiniMax

Detecting and preventing distillation attacks

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

anthropic.com