Jack

Jack

560 Photos and videos

Tweets

Pinned Tweet

Jack @JackNotOld

May 26

Put out my first LessWrong blog post! Interpretability treats steering directions like "control knobs". I checked whether that assumption is mathematically valid across 8 different models. At α = 1, it breaks in 92% of cases. lesswrong.com/posts/nnwLHsBb…

149

david rein

Jack retweeted

david rein

@idavidrein

Jun 8

Replying to @willdepue

totally disagree. it's one of the best reward signals we have because it's very difficult to hack and is highly correlated with having an accurate model of the world.

205

17,679

Jack

Jack @JackNotOld

Jun 6

Been wearing a Garmin for a year, looking to transition to something more compact for everyday use/sleep, outside of workouts. I’ve basically ruled out whoop aura, so is Google Fitbit any good? Worth the time?

144

Jack

Jack @JackNotOld

Jun 5

The @ElevenLabs robot is serving coffee in SoHo.

elie

Jack retweeted

elie

@eliebakouch

Jun 2

WOW microsoft new "MAI Thinking 1" model comes with a 109 page tech report that looks REALLY detailed, this is amazing

120

987

199,728

Ali Hatamizadeh

Jack retweeted

Ali Hatamizadeh

@ahatamiz1

May 21

Gated DeltaNet-2 is here. 🚀 🔥 New paper: Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention Gated DeltaNet-2 outperforms KDA and Mamba-3, the latest and best recurrent architectures, head to head at 1.3B. 🏆 💡 Here's the idea behind it: Linear attention squeezes an unbounded KV cache into a fixed-size recurrent state. The hard part isn't just what to forget, it's how to edit that memory without scrambling the associations already in it. Prior delta-rule models like Gated DeltaNet and KDA use one scalar gate to do two jobs at once: erasing old content and writing new content. But these two decisions act on different axes of the state, so tying them together is a real limitation. Gated DeltaNet-2 decouples them. ✂️ a channel-wise erase gate b_t picks which key-side coordinates to read and remove ✍️ a channel-wise write gate w_t picks which value-side coordinates to commit 🔁 recovers KDA when both gates collapse to a scalar, and Gated DeltaNet when the decay collapses too ⚡ still trains fast: chunkwise WY algorithm with gate-aware backward, fused in Triton 📊 Results: We train 1.3B models on 100B tokens of FineWeb-Edu, matched in recurrent state size, against Mamba-2, Gated DeltaNet, KDA, and Mamba-3. Best average on language modeling commonsense reasoning, in both recurrent and hybrid settings Biggest gains on long-context RULER retrieval. S-NIAH-3 jumps from 63 to 90 over KDA, and multi-key needle retrieval climbs from 28 to 38 Joint work with @YejinChoinka and @jankautz. 📄 Paper: shorturl.at/AAlVb 💻 Code: github.com/NVlabs/GatedDelta… #LinearAttention #StateSpaceModels #Mamba #LLM

654

194,220

Jack

Jack @JackNotOld

May 21

New paper! Trained an SAE on Qwen's recurrent state writes. Found an "erase" feature. Substituting it for the model's "write" drops the target token from next-token logits. The shift factors through forget, read, output at R²=0.98 with no fitted params. arxiv.org/abs/2605.12770

906

Jack

Jack @JackNotOld

May 21

HF: huggingface.co/JackYoung27/w… Code: github.com/JackYoung27/write…

123

Jediah Katz

Jack retweeted

Jediah Katz

@jediahkatz

May 19

i would never hire anyone with a 4 year resume gap

Polymarket Money

@PolymarketMoney

May 19

Andrej Karpathy's incredible resume: > Google, Working on DeepMind (2015) > OpenAI, Founding member (2016 - 2017) > Tesla, Senior Director of AI (2017 - 2022) > Anthropic, Working on R&D (2026)

225

152

7,782

1,523,054

Citrini

Jack retweeted

Citrini

@citrini

May 15

Morgan Stanley’s price discovery happens on @tradexyz

@sershokunin

May 15

👀 @tradexyz @HyperliquidX

127

1,648

219,909

Jack

Jack @JackNotOld

May 11

“Directionally very interactive”

Thinking Machines

@thinkymachines

May 11

People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way. We share our approach, early results, and a quick look at our model in action. thinkingmachines.ai/blog/int…

2:15

136

Jack

Jack @JackNotOld

Apr 8

Trained states and dataset now on @huggingface Hub 🤗 Hybrid models (Qwen3.5, FalconH1) initialize 75% of their parameters to zero. We trained those initial states on 45 verified solutions: 23.6pp on HumanEval, 10.8pp over LoRA, zero inference overhead. Try S₀ tuning on Qwen3.5-4B without training: huggingface.co/JackYoung27/s… Training data (45 verified HumanEval solutions): huggingface.co/datasets/Jack… Github: github.com/JackYoung27/s0-tu… Paper: huggingface.co/papers/2604.0…

221

martin_casado

Jack retweeted

martin_casado

@martin_casado

Apr 8

Mythos appears to be the first class of models trained at scale on Blackwells. Then will be Vera Rubins. Pre-training isn't saturated. RL works. And there is *so much* computing coming online soon. Buckle your chin strips. It's going to be fucking wild.

106

307

3,899

453,310

Jack

Jack retweeted

Jack @JackNotOld

Apr 2

Code: github.com/JackYoung27/s0-tu… Paper: arxiv.org/abs/2604.01168 pip install s0-tuning This suggests a different axis of adaptation: state, not weights.

202

Jack

Jack retweeted

Jack @JackNotOld

Apr 2

We’ve been tuning the wrong part of LLMs. Instead of adapting weights (LoRA) or adding adapters, we tune the initial recurrent state (S0) in hybrid recurrent LLMs. This beats LoRA by 10.8pp on HumanEval (72.2% vs 61.4% on Qwen3.5-4B), with zero inference overhead.

1,636

Jack

Jack @JackNotOld

Apr 2

1,636

more replies

Jack

Jack @JackNotOld

Apr 2

~20 gradient steps no hyperparameter search runs on a single consumer GPU merges with a single tensor copy No changes at inference. No added latency. 85% of corrected solutions diverge at the first token, suggesting early state control drives the gains.

105

Jack

Jack @JackNotOld

Apr 2

Code: github.com/JackYoung27/s0-tu… Paper: arxiv.org/abs/2604.01168 pip install s0-tuning This suggests a different axis of adaptation: state, not weights.

202