Noam Hurwitz

Noam Hurwitz

246 Photos and videos

Tweets

Pinned Tweet

Noam Hurwitz

@ProbablyNoam

May 14

x.com/i/article/205473226070…

1,037

Nebius Token Factory

Noam Hurwitz retweeted

Nebius Token Factory

@nebiustf

Jun 13

Not your weights, not your model.

411

57,302

Noam Hurwitz

Noam Hurwitz

@ProbablyNoam

Jun 11

Producing just a single Claude Fable 5 or GPT 5.5 token moves hundreds of GB of data. Indeed, as large models continue to scale active parameters and context sizes, an increasing share of time goes to moving data to the processors rather than actually processing it. The result is longer serving latency and expensive, idle compute. This is called the memory wall, and different approaches and tradeoffs to it are explored below. For the full interactive essay, check out the link in comments. ----- Modern GPUs are increasingly fast at performing the arithmetic behind frontier AI. The issues arise because the data for the computation has to be physically present on the compute die before the arithmetic can start. For large models, this can be a lot of data. Specifically: - The model weights need to be available. A good heuristic is that 1B parameters equates to approximately 1GB of weights (at 8-bit precision). So Mythos-class models with 10T parameters require 10 TB of memory to store them. -- Mixture-of-expert architectures enable inference on a subset of these weights, so for a given computation perhaps “only” a few hundred GB is actually needed. - The user context also needs to be available. For frontier models, this context can be up to 1M tokens, each of which attends to every other token. The resulting key-value (KV) cache can easily be tens of GB per user. The problem arises as the ratio of data needed to on-die memory available continues to climb. As an example, consider the new NVIDIA Blackwell Ultra which has 160 register files of 256 KB each that can be processed in parallel. That means that for 100GB of weights there is roughly 40MB of available working memory, and that to produce just one token those weights must stream through in 100 GB / 40 MB = 2,500 fills of the combined register files — and they must do so again for every token, before even factoring in user context. Modern architectures tier the memory so that the most commonly needed values are stored closest to the compute dies, so often these fills are blindingly fast. But in a sequence that produces millions of tokens, in parallel for hundreds of millions of users, time and idle compute quickly add up. The link beloe illustrates two different breakdowns of the problem. The first breakdown looks at how real chip architectures make tradeoffs to tackle this problem, using real model scenarios (The Cost to Produce a Token). These chips are then compared on axes of token throughputs and configurability, with different model selections again highlighting the tradeoffs made (The Architecture Tradeoff Matrix). The second breakdown explores how fast each successive tier of memory is in this shuttling sequence and the impact of storing values at that layer on token throughput for the same real model scenarios (Memory Tiering Impact on Token Throughput).

0:11

274

Noam Hurwitz

Noam Hurwitz

@ProbablyNoam

Jun 11

Interactive essay: nhurwitz.com/writing/the-mem…

127

Noam Hurwitz

Noam Hurwitz

@ProbablyNoam

Jun 8

TIL about powerline networking. You can actually network over existing household copper power wires. Normal household electricity in the U.S. is alternating current at 60 Hz. Powerline networking superimposes a separate, much higher-frequency signal onto the same wires. Adapters on both ends generate and then filter for this signal. Router → Ethernet cable → powerline adapter → house electrical wiring → powerline adapter → Ethernet cable → online device.

0:08

307

Noam Hurwitz

Noam Hurwitz

@ProbablyNoam

Jun 8

Helpful breakdown of frontier architectures for robotics models!

Vai Viswanathan

@vai_viswanathan

Jun 8

As promised, breaking down VLA-JEPA. VLA-JEPA’s contribution is a unique way to learn control-relevant dynamics from that unlabeled video. tldr; a standard VLA architecture with a World Model that “teaches” the policy physics. 3 training phases, 4 model components. This one took a few reads some LLM Q&A before I could grok it. Follow along for an easy-to-understand breakdown!

2,760

Noam Hurwitz

Noam Hurwitz

@ProbablyNoam

Jun 7

A charmed city ~

2,356

Noam Hurwitz

Noam Hurwitz

@ProbablyNoam

Jun 5

Hearing that we’re already seeing a microcosm of this at Google. Internal teams are going to be facing a capacity squeeze and resource allocation contention. For now, DRAM prices are the main driver.

roon

@tszzl

Jun 4

there is no such thing as running out of compute. for the right price someone will sell you compute. it’s an elastic resource like all other markets. when RSI arrives running that program will be so valuable that all clouds will mostly shut down and sell compute to the singularity

956

Noam Hurwitz

Noam Hurwitz

@ProbablyNoam

Jun 5

Doing my part to contribute to shortage 🫡

285

Noam Hurwitz

Noam Hurwitz

@ProbablyNoam

Jun 5

x.com/kairospraxis/status/20…

Kairos @KairosPraxis

Jun 5

More on memory prices causing shifts in the real economy - this time from Andy Jassy at $AMZN. In short, more companies are moving from on-prem to the cloud (to AWS) because of the memory supply crunch. Hyperscalers are the only ones with storage capacity right now. We live in wild, wild times.

355

Noam Hurwitz

Noam Hurwitz

@ProbablyNoam

Jun 5

Great post from @EclipseVentures. So many conversations end up pointing fingers at design for manufacturing. If you're late to market with your perfect design because its not manufacturable, its not a perfect design. "Design is manufacturability. They are the same job." eclipse.capital/blog/how-des…

Why the 'Design for Manufacturing' Mindset Kills Companies

eclipse.capital

204

Noam Hurwitz

Noam Hurwitz

@ProbablyNoam

Jun 3

I’ve recently interviewed ~30 first-time hardware founders/operators. Nobody shipped on their original timeline, and nearly all were off by more than a year. Crazy that validation in market takes this long to test.

1,823

Dima Smirnov

Noam Hurwitz retweeted

Dima Smirnov

@dima_smirnov_

Jun 3

I’ve always believed text prompts for image generation to be a bug, not a feature. Reve 2.0 fixes that. By making Large Layout Models a reality, we deliver the intuitive, hands-on control needed to make image generation truly useful, usable, and fun. So proud of our team!

Reve

@reve

Jun 3

Today, we’re launching Reve 2.0, the best 4K image model in the world. We invented a new way to generate and edit any image using precise layouts. For the first time, it’s possible to create images you can touch.

0:54

2,981

Noam Hurwitz

Noam Hurwitz

@ProbablyNoam

Jun 2

x.com/i/article/206158638522…

642

Noam Hurwitz

Noam Hurwitz

@ProbablyNoam

Jun 2

Interactive version here: nhurwitz.com/writing/net-pay…

164

Noam Hurwitz

Noam Hurwitz

@ProbablyNoam

Jun 1

Brought to you by the codex text-to-cad skill, thank you @earthtojake.

0:09

612

Palmer Luckey

Noam Hurwitz retweeted

Palmer Luckey

@PalmerLuckey

May 25

Replying to @AI_EmeraldApple

This is why we need American-made computers and phones.

253

321

9,352

130,620

Vai Viswanathan

Noam Hurwitz retweeted

Vai Viswanathan

@vai_viswanathan

May 24

touch grass

Sahil

@sahill_og

May 23

Name one thing that you can do better than CLAUDE

630

🅿️

Noam Hurwitz retweeted

🅿️

@the_P_God

May 21

Logged on RuneScape for the first time in a while. Map looks a little different.

107

1,292

23,056

870,642

Noam Hurwitz

Noam Hurwitz

@ProbablyNoam

May 17

Welcome to Alfred’s Cafe, we have jolly good service.

0:34

Noam Hurwitz

@ProbablyNoam

May 16

2,015

Noam Hurwitz

Noam Hurwitz

@ProbablyNoam

May 18

Alfred just sold a $3 strawberry in 45 seconds. That equates to a >$2m run rate 🤯. Ping @earthtojake if you're a VC.

165