Put out my first LessWrong blog post!
Interpretability treats steering directions like "control knobs". I checked whether that assumption is mathematically valid across 8 different models.
At α = 1, it breaks in 92% of cases.
lesswrong.com/posts/nnwLHsBb…
totally disagree.
it's one of the best reward signals we have because it's very difficult to hack and is highly correlated with having an accurate model of the world.
Been wearing a Garmin for a year, looking to transition to something more compact for everyday use/sleep, outside of workouts.
I’ve basically ruled out whoop aura, so is Google Fitbit any good? Worth the time?
Gated DeltaNet-2 is here. 🚀
🔥 New paper: Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention
Gated DeltaNet-2 outperforms KDA and Mamba-3, the latest and best recurrent architectures, head to head at 1.3B. 🏆
💡 Here's the idea behind it:
Linear attention squeezes an unbounded KV cache into a fixed-size recurrent state. The hard part isn't just what to forget, it's how to edit that memory without scrambling the associations already in it.
Prior delta-rule models like Gated DeltaNet and KDA use one scalar gate to do two jobs at once: erasing old content and writing new content. But these two decisions act on different axes of the state, so tying them together is a real limitation.
Gated DeltaNet-2 decouples them.
✂️ a channel-wise erase gate b_t picks which key-side coordinates to read and remove
✍️ a channel-wise write gate w_t picks which value-side coordinates to commit
🔁 recovers KDA when both gates collapse to a scalar, and Gated DeltaNet when the decay collapses too
⚡ still trains fast: chunkwise WY algorithm with gate-aware backward, fused in Triton
📊 Results:
We train 1.3B models on 100B tokens of FineWeb-Edu, matched in recurrent state size, against Mamba-2, Gated DeltaNet, KDA, and Mamba-3.
Best average on language modeling commonsense reasoning, in both recurrent and hybrid settings
Biggest gains on long-context RULER retrieval. S-NIAH-3 jumps from 63 to 90 over KDA, and multi-key needle retrieval climbs from 28 to 38
Joint work with @YejinChoinka and @jankautz.
📄 Paper: shorturl.at/AAlVb
💻 Code: github.com/NVlabs/GatedDelta…#LinearAttention#StateSpaceModels#Mamba#LLM
New paper! Trained an SAE on Qwen's recurrent state writes.
Found an "erase" feature. Substituting it for the model's "write" drops the target token from next-token logits. The shift factors through forget, read, output at R²=0.98 with no fitted params.
arxiv.org/abs/2605.12770
Andrej Karpathy's incredible resume:
> Google, Working on DeepMind (2015)
> OpenAI, Founding member (2016 - 2017)
> Tesla, Senior Director of AI (2017 - 2022)
> Anthropic, Working on R&D (2026)
People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way.
We share our approach, early results, and a quick look at our model in action.
thinkingmachines.ai/blog/int…
Mythos appears to be the first class of models trained at scale on Blackwells. Then will be Vera Rubins. Pre-training isn't saturated. RL works. And there is *so much* computing coming online soon.
Buckle your chin strips. It's going to be fucking wild.
We’ve been tuning the wrong part of LLMs.
Instead of adapting weights (LoRA) or adding adapters, we tune the initial recurrent state (S0) in hybrid recurrent LLMs.
This beats LoRA by 10.8pp on HumanEval (72.2% vs 61.4% on Qwen3.5-4B), with zero inference overhead.
We’ve been tuning the wrong part of LLMs.
Instead of adapting weights (LoRA) or adding adapters, we tune the initial recurrent state (S0) in hybrid recurrent LLMs.
This beats LoRA by 10.8pp on HumanEval (72.2% vs 61.4% on Qwen3.5-4B), with zero inference overhead.
~20 gradient steps
no hyperparameter search
runs on a single consumer GPU
merges with a single tensor copy
No changes at inference. No added latency.
85% of corrected solutions diverge at the first token, suggesting early state control drives the gains.