Zhihu Frontier

Zhihu Frontier

Users
Tweets

Zhihu Frontier

@ZhihuFrontier

Apr 1

🚨 Anthropic’s Claude Code Source Leak — What It Actually Exposes A careless build mistake just laid bare one of the most advanced AI coding tools — and the lessons are huge. Insights from Zhihu contributor deephub 👇 🏢 About Anthropic Anthropic is a leading AI safety-focused company, widely known for building the Claude model series with a strong emphasis on security and reliability. 📉 The Leak in Short A 60MB source map file was mistakenly bundled in the npm release, revealing full source code, system prompts, and internal logic — though no model weights were leaked. 🧠 Expert Analysis from deephub: • The incident stems from a basic build configuration error, likely from manual packaging, with a similar quiet incident in early 2025. • The leak effectively makes Claude Code 「open-sourced」, exposing agent orchestration, tool execution, and context management strategies. • Yet its real strength lies in the model’s native code reasoning ability, so competitive and technical damage remains limited. • For a multi-billion-dollar, safety-focused AI company, this low-level engineering mistake is highly embarrassing. • The leak also uncovered unreleased features: task budget management, AFK mode, Penguin (fast mode), and redirected reasoning. 🎯 Final Takeaway A costly misstep for Anthropic, but an unprecedented learning opportunity for the entire AI agent community. 🔗 Full article(CN): zhihu.com/question/202239436… #AI #LLM #Agents #Claude #Engineering #Tech

690

Zhihu Frontier

Zhihu Frontier

@ZhihuFrontier

Mar 25

🚨 Big shift in AI video: Anouncements of OpenAI shutting down Sora are sparking debate—why do flashy AI video products struggle to last? 💡 Zhihu contributor deephub: the bottleneck is business, not tech. • compute is too expensive • regulation copyright risks • low retention 👉 Consumer AI video still lacks a sustainable model. That's why Anthropic sticks to B2B—and OpenAI may follow. 🎰 Zhihu contributor 詹于 compares Sora to Seedance: AI video is just like a "gacha game" — You generate N outputs, then pick one. What matters is hit rate, not just capability. Even 1% hit rate = huge usability gain. • Sora → low accuracy → feels like a toy • Seedance (ByteDance) → higher consistency, making it from a "toy" → "imperfect but usable tool" Plus: China has massive "semi-pro" content demand: short videos, livestreams, ads, web novels… 👉 These markets don't need perfection—just better-than-before tools. That's why products like Seedance can scale quickly. 🧠 Zhihu contributor 12345: OpenAI lost focus Too many fronts (Sora, Atlas, AgentKit, etc.), high cost unclear ROI. Given extreme video generation costs & ongoing copyright issues (across the industry), shutting Sora might simply be too late—but necessary. 👉 Future direction? Possibly back to devs, coding, enterprise, core infra. 🔗Join the discussion: zhihu.com/question/202006008… #AI #Sora #OpenAI #Video #GenerativeAI #Tech

348

Zhihu Frontier

Zhihu Frontier

@ZhihuFrontier

Feb 27

🤔Can agentic LLM inference break free from storage bandwidth limits? This new paper by DeepSeek together with THU & PKU says yes by rethinking the Prefill / Decode split at the system level, which draws major attention.🚀 What's the real innovation? 👉 Zhihu contributor deephub explains — relentlessly extracting every last bit of GPU bandwidth. 🔍 Core insight: a hidden resource mismatch Prefill nodes saturate their storage NIC bandwidth, while decode nodes leave their storage NICs almost completely idle DualPath treats these two NICs as a global bandwidth pool, instead of isolated resources. So traditional path: Storage → Prefill engine, now for new parallel path: Storage → Decode engine (as a buffer) → RDMA (high-speed compute NIC) → Prefill engine. 💡 In short: idle Decode-side bandwidth now participates in KV-cache movement, not just computation. ⚙️ Why this is hard to build 1️⃣ Dataflow reorganization is extremely complex • KV-cache may traverse two physical paths • Must stream layer-by-layer, overlapped with compute • Requires seamless transitions across storage, DRAM, and HBM • Any timing mismatch → GPU stalls or buffer overflow 2️⃣ Traffic isolation goes deep into hardware • Using DSCP marking TC on RoCE for traffic classification is not application-layer work • Done wrong, KV traffic will starve inference communication and worsen latency 3️⃣ Scheduler design is critical • Must observe disk queues, compute load & both paths in real time, and dynamically allocate bandwidth • Internally, engines split token blocks via binary search under compute quotas • Scheduler itself must be fast, or it becomes the bottleneck 📊 Performance results Offline inference: Up to 1.87× speedup on DS-660B Smaller models (DS-27B): Even with DualPath, TPOT remains higher than baseline → bandwidth gains can't amortize costs 🤖 Why Agentic scenarios matter most Large agentic models repeatedly access long-context KV caches, where I/O bandwidth becomes the true bottleneck. DualPath is explicitly designed for this load pattern — enabling higher concurrency, faster responses, and lower inference cost in multi-agent systems. 💬 Join the discussion on Zhihu: zhihu.com/question/201067068… #LLM #Agentic #AI #Inference #KVCache #DeepSeek #Infra

691

离谱

离谱

@LipuAIX

Jan 25

x.com/i/article/201538935872…

118

78,555

OG Alpha

OG Alpha @Bet_Winstreak

Jan 20

Deeptics @Deeptics_ is now live on Sol as DEEPHUB $DHUB, Robotics 🤖 on-chain ecosystem that unifies robotic simulation. $7k mc is a gift. Don't miss it CA: 5mZkHgUtEsFZ4qgdYy3qFMrrx8ndxYAi8zYYg7oFpump TG: t.me/deeptics

0:05

276

Zhihu Frontier

Zhihu Frontier

@ZhihuFrontier

Jan 14

@deepseek_ai just dropped a joint paper with Peking University—and it's turning heads. 🧠🔥 Zhihu contributor deephub breaks it down clearly: 📌 The core idea is: Separate "thinking" from "remembering." DeepSeek is pushing sparsity to the extreme: • MoE = sparse computation → only a subset of experts activate each step ⚙️ • Engram = sparse storage → only relevant memory fragments are retrieved 📚 Put together, this likely previews what DeepSeek V4 could look like: models with exploding parameter counts, but inference costs that stay surprisingly low. A future LLM might be: a small, sharp reasoning core with a huge, external, constantly updatable memory system. ❓ What problem are they really solving? It's about the fact that Transformers have no native memory. Today's LLMs—even MoE models—use expensive neural computation to simulate memory. They repeatedly rebuild what is basically a lookup table using attention FFN layers, which is inefficient. 👉 DeepSeek's key move: add real memory They split language tasks into two types: • Compositional reasoning → needs deep neural computation • Knowledge retrieval → should be cheap and direct So instead of forcing neural nets to "calculate" static facts, they add a new module: Engram = conditional memory. How it works: • Input text is broken into n-grams • These n-grams are mapped into a huge hash table • Lookup is O(1) time • Retrieved vectors are fused with normal Transformer outputs Names, formulas, fixed phrases, factual patterns: no more wasting layers to reconstruct them. Cheap memory handles static patterns, expensive compute is saved for real reasoning. Engram turns the classic process into: slice → hash → lookup → fuse constant time, no matter how big the memory gets. 🤔 MoE Engram: how to split capacity? They study a key tradeoff: given fixed parameter and compute budgets, how much goes to MoE experts vs Engram memory? They define an allocation ratio ρ: • ρ = 100% → pure MoE • ρ ≈ 40% → already matches pure MoE performance • Best point: ρ ≈ 75–80% → validation loss drops below pure MoE by ~0.014 This gives a U-shaped curve: • All MoE → no real memory, wasted compute • All Engram → weak reasoning • Balanced → best of both worlds So MoE and Engram are structurally complementary. ‼️ Infrastructure implications are huge Engram indices depend only on input tokens, so they're known before the forward pass. That enables: • Prefetching memory asynchronously • Overlapping lookup with early-layer computation • Hiding communication latency Even more important: Engram is sparse read-only at inference. So the memory table doesn't need HBM and can live in normal RAM. You're basically attaching a giant internal knowledge base to the model. In theory, Engram can be updated directly: New knowledge → update the table. Faster than LoRA. Full analysis: zhihu.com/question/199423340… #DeepSeek #AI #MoE #Engram #LLM #Research

2,525