Recovering software guy.

Joined March 2020
246 Photos and videos
Noam Hurwitz retweeted
Not your weights, not your model.
5
33
411
57,302
Producing just a single Claude Fable 5 or GPT 5.5 token moves hundreds of GB of data. Indeed, as large models continue to scale active parameters and context sizes, an increasing share of time goes to moving data to the processors rather than actually processing it. The result is longer serving latency and expensive, idle compute. This is called the memory wall, and different approaches and tradeoffs to it are explored below. For the full interactive essay, check out the link in comments. ----- Modern GPUs are increasingly fast at performing the arithmetic behind frontier AI. The issues arise because the data for the computation has to be physically present on the compute die before the arithmetic can start. For large models, this can be a lot of data. Specifically: - The model weights need to be available. A good heuristic is that 1B parameters equates to approximately 1GB of weights (at 8-bit precision). So Mythos-class models with 10T parameters require 10 TB of memory to store them. -- Mixture-of-expert architectures enable inference on a subset of these weights, so for a given computation perhaps “only” a few hundred GB is actually needed. - The user context also needs to be available. For frontier models, this context can be up to 1M tokens, each of which attends to every other token. The resulting key-value (KV) cache can easily be tens of GB per user. The problem arises as the ratio of data needed to on-die memory available continues to climb. As an example, consider the new NVIDIA Blackwell Ultra which has 160 register files of 256 KB each that can be processed in parallel. That means that for 100GB of weights there is roughly 40MB of available working memory, and that to produce just one token those weights must stream through in 100 GB / 40 MB = 2,500 fills of the combined register files — and they must do so again for every token, before even factoring in user context. Modern architectures tier the memory so that the most commonly needed values are stored closest to the compute dies, so often these fills are blindingly fast. But in a sequence that produces millions of tokens, in parallel for hundreds of millions of users, time and idle compute quickly add up. The link beloe illustrates two different breakdowns of the problem. The first breakdown looks at how real chip architectures make tradeoffs to tackle this problem, using real model scenarios (The Cost to Produce a Token). These chips are then compared on axes of token throughputs and configurability, with different model selections again highlighting the tradeoffs made (The Architecture Tradeoff Matrix). The second breakdown explores how fast each successive tier of memory is in this shuttling sequence and the impact of storing values at that layer on token throughput for the same real model scenarios (Memory Tiering Impact on Token Throughput).
1
5
274
TIL about powerline networking. You can actually network over existing household copper power wires. Normal household electricity in the U.S. is alternating current at 60 Hz. Powerline networking superimposes a separate, much higher-frequency signal onto the same wires. Adapters on both ends generate and then filter for this signal. Router → Ethernet cable → powerline adapter → house electrical wiring → powerline adapter → Ethernet cable → online device.
1
4
307
Helpful breakdown of frontier architectures for robotics models!
As promised, breaking down VLA-JEPA. VLA-JEPA’s contribution is a unique way to learn control-relevant dynamics from that unlabeled video. tldr; a standard VLA architecture with a World Model that “teaches” the policy physics. 3 training phases, 4 model components. This one took a few reads some LLM Q&A before I could grok it. Follow along for an easy-to-understand breakdown!
1
6
2,760
A charmed city ~
1
2
26
2,356
Hearing that we’re already seeing a microcosm of this at Google. Internal teams are going to be facing a capacity squeeze and resource allocation contention. For now, DRAM prices are the main driver.
Jun 4
there is no such thing as running out of compute. for the right price someone will sell you compute. it’s an elastic resource like all other markets. when RSI arrives running that program will be so valuable that all clouds will mostly shut down and sell compute to the singularity
1
10
956
Doing my part to contribute to shortage 🫡
2
3
285

More on memory prices causing shifts in the real economy - this time from Andy Jassy at $AMZN. In short, more companies are moving from on-prem to the cloud (to AWS) because of the memory supply crunch. Hyperscalers are the only ones with storage capacity right now. We live in wild, wild times.
1
355
Great post from @EclipseVentures. So many conversations end up pointing fingers at design for manufacturing. If you're late to market with your perfect design because its not manufacturable, its not a perfect design. "Design is manufacturability. They are the same job." eclipse.capital/blog/how-des…
3
204
I’ve recently interviewed ~30 first-time hardware founders/operators. Nobody shipped on their original timeline, and nearly all were off by more than a year. Crazy that validation in market takes this long to test.
3
9
1,823
Noam Hurwitz retweeted
I’ve always believed text prompts for image generation to be a bug, not a feature. Reve 2.0 fixes that. By making Large Layout Models a reality, we deliver the intuitive, hands-on control needed to make image generation truly useful, usable, and fun. So proud of our team!
Jun 3
Today, we’re launching Reve 2.0, the best 4K image model in the world. We invented a new way to generate and edit any image using precise layouts. For the first time, it’s possible to create images you can touch.
7
35
2,981
Brought to you by the codex text-to-cad skill, thank you @earthtojake.
1
4
612
Noam Hurwitz retweeted
Replying to @AI_EmeraldApple
This is why we need American-made computers and phones.
253
321
9,352
130,620
Noam Hurwitz retweeted
touch grass
May 23
Name one thing that you can do better than CLAUDE
1
1
11
630
Noam Hurwitz retweeted
Logged on RuneScape for the first time in a while. Map looks a little different.
107
1,292
23,056
870,642
Welcome to Alfred’s Cafe, we have jolly good service.
5
2
20
2,015
Alfred just sold a $3 strawberry in 45 seconds. That equates to a >$2m run rate 🤯. Ping @earthtojake if you're a VC.
3
165