High-throughput chips for LLMs.

Joined July 2023
2 Photos and videos
Pinned Tweet
Our goal is to make the best chips physically possible for the large model needs of frontier labs. The MatX One chip delivers higher throughput than any announced product while also achieving the lowest latency. @reinerpope joined @BloombergTV this morning to discuss:
We’re building an LLM chip that delivers much higher throughput than any other chip while also achieving the lowest latency. We call it the MatX One. The MatX One chip is based on a splittable systolic array, which has the energy and area efficiency that large systolic arrays are famous for, while also getting high utilization on smaller matrices with flexible shapes. The chip combines the low latency of SRAM-first designs with the long-context support of HBM. These elements, plus a fresh take on numerics, deliver higher throughput on LLMs than any announced system, while simultaneously matching the latency of SRAM-first designs. Higher throughput and lower latency give you smarter and faster models for your subscription dollar. We’ve raised a $500M Series B to wrap up development and quickly scale manufacturing, with tapeout in under a year. The round was led by Jane Street, one of the most tech-savvy Wall Street firms, and Situational Awareness LP, whose founder @leopoldasch wrote the definitive memo on AGI. Participants include @sparkcapital, @danielgross and @natfriedman’s fund, @patrickc and @collision, @TriatomicCap, @HarpoonVentures, @karpathy, @dwarkesh_sp, and others. We’re also welcoming investors across the supply chain, including Marvell and Alchip. @MikeGunter_ and I started MatX because we felt that the best chip for LLMs should be designed from first principles with a deep understanding of what LLMs need and how they will evolve. We are willing to give up on small-model performance, low-volume workloads, and even ease of programming to deliver on such a chip. We’re now a 100-person team with people who think about everything from learning rate schedules, to Swing Modulo Scheduling, to guard/round/sticky bits, to blind-mated connections—all in the same building. If you’d like to help us architect, design, and deploy many generations of chips in large volume, consider joining us.
6
11
96
39,984
MatX retweeted
New blackboard lecture w @reinerpope How do chips actually work – starting with basic logic gates, and working up to why GPUs, TPUs, FPGAs, and the human brain each look the way they do. 0:00:00 – Building a multiply-accumulate from logic gates 0:16:20 – Muxes and the cost of data movement 0:25:59 – How systolic arrays work 0:39:00 – Clock cycles and pipeline registers 0:51:40 – FPGAs vs ASICs 1:03:14 – Cache vs scratchpad 1:07:16 – Why CPU cores are much bigger than GPU cores 1:11:49 – Brains vs chips 1:15:22 – A GPU is just a bunch of tiny TPUs Look up Dwarkesh Podcast on YouTube/Spotify/etc to watch. Enjoy!
94
725
5,593
925,750
MatX retweeted
Did a very different format with @reinerpope – a blackboard lecture where he walks through how frontier LLMs are trained and served. It's shocking how much you can deduce about what the labs are doing from a handful of equations, public API prices, and some chalk. It’s a bit technical, but I encourage you to hang in there - it’s really worth it. There are less than a handful of people who understand the full stack of AI, from chip design to model architecture, as well as Reiner. It was a real delight to learn from him. Recommend watching this one on YouTube so you can see the chalkboard. 0:00:00 – How batch size affects token cost and speed 0:31:59 – How MoE models are laid out across GPU racks 0:47:02 – How pipeline parallelism spreads model layers across racks 1:03:27 – Why Ilya said, “As we now know, pipelining is not wise.” 1:18:49 – Because of RL, models may be 100x over-trained beyond Chinchilla-optimal 1:32:52 – Deducing long context memory costs from API pricing 2:03:52 – Convergent evolution between neural nets and cryptography
150
598
6,563
1,281,003
MatX retweeted
Intelligence per picojoule, with @itsclivetime and @dylan522p (0:00) Intro (1:22) What is codesign? (2:49) Codesign example: Swish vs ReLU (4:22) Are DeepSeek papers codesign? (6:45) Predicting where ML research will go (8:06) Should researchers hate your chips? (9:34) Can you codesign too much? (13:23) Picking the right grain size for specialization (16:22) How much hardware flexibility for The Age of Research? (20:05) Did reasoning and RL disrupt hardware roadmaps? (23:09) Cerebras/Groq: unexpected wins on reasoning and RL (25:34) Disaggregating MLP and attention (29:06) The right metrics for quantization and codesign papers
11
57
624
145,782
MatX retweeted
I chatted with @ysmulki about MatX, chip design and where silicon designed for LLMs is headed (8:17) Tightly coupling SRAM and HBM on one chip (14:03) More MoE FLOPS, smaller KV cache load (16:08) Numerics: from 32-bit to 4-bit (19:02) Targeting both training and inference (22:14) Chip timelines (27:15) Logic and memory scarcity (29:42) Compute costs (32:07) Latency: from 20ms to 1ms as the new table stakes (40:50) Programming the chip (43:00) Starting MatX (47:11) Codesign without seeing the models (51:57) Interconnect design (55:44) Performance modeling philosophy (1:07:02) Prefill vs. decode (1:13:47) What's next
15
45
326
69,631
MatX retweeted
Reiner Pope (@MatXComputing) just raised a $500m round led by @leopoldasch and Jane Street to build faster AI chips. I enjoyed having him on Cheeky Pint so I could ask all my questions about how chip design actually works, where the speed-up comes from, and how the industry will evolve. 00:00:15 Google’s AI revival 00:07:54 MatX 00:17:11 AI supply chain 00:21:48 Designing chips 00:37:11 TSMC 00:44:17 Token pricing 00:44:55 RL-ing chip design 00:49:26 Design to production 00:56:05 MatX culture 01:02:57 Rust 01:05:21 Cuckoo hashing 01:09:35 Unexplored model architectures
21
41
434
53,588
Our goal is to make the best chips physically possible for the large model needs of frontier labs. The MatX One chip delivers higher throughput than any announced product while also achieving the lowest latency. @reinerpope joined @BloombergTV this morning to discuss:
We’re building an LLM chip that delivers much higher throughput than any other chip while also achieving the lowest latency. We call it the MatX One. The MatX One chip is based on a splittable systolic array, which has the energy and area efficiency that large systolic arrays are famous for, while also getting high utilization on smaller matrices with flexible shapes. The chip combines the low latency of SRAM-first designs with the long-context support of HBM. These elements, plus a fresh take on numerics, deliver higher throughput on LLMs than any announced system, while simultaneously matching the latency of SRAM-first designs. Higher throughput and lower latency give you smarter and faster models for your subscription dollar. We’ve raised a $500M Series B to wrap up development and quickly scale manufacturing, with tapeout in under a year. The round was led by Jane Street, one of the most tech-savvy Wall Street firms, and Situational Awareness LP, whose founder @leopoldasch wrote the definitive memo on AGI. Participants include @sparkcapital, @danielgross and @natfriedman’s fund, @patrickc and @collision, @TriatomicCap, @HarpoonVentures, @karpathy, @dwarkesh_sp, and others. We’re also welcoming investors across the supply chain, including Marvell and Alchip. @MikeGunter_ and I started MatX because we felt that the best chip for LLMs should be designed from first principles with a deep understanding of what LLMs need and how they will evolve. We are willing to give up on small-model performance, low-volume workloads, and even ease of programming to deliver on such a chip. We’re now a 100-person team with people who think about everything from learning rate schedules, to Swing Modulo Scheduling, to guard/round/sticky bits, to blind-mated connections—all in the same building. If you’d like to help us architect, design, and deploy many generations of chips in large volume, consider joining us.
6
11
96
39,984
MatX retweeted
Put your KVs in HBM! Contexts are growing ~infinitely long, and HBM is the best place to fit them. Definitely won't be underutilized: *everyone* is running out of HBM capacity.
4
1
17
2,209
MatX retweeted
We’re building an LLM chip that delivers much higher throughput than any other chip while also achieving the lowest latency. We call it the MatX One. The MatX One chip is based on a splittable systolic array, which has the energy and area efficiency that large systolic arrays are famous for, while also getting high utilization on smaller matrices with flexible shapes. The chip combines the low latency of SRAM-first designs with the long-context support of HBM. These elements, plus a fresh take on numerics, deliver higher throughput on LLMs than any announced system, while simultaneously matching the latency of SRAM-first designs. Higher throughput and lower latency give you smarter and faster models for your subscription dollar. We’ve raised a $500M Series B to wrap up development and quickly scale manufacturing, with tapeout in under a year. The round was led by Jane Street, one of the most tech-savvy Wall Street firms, and Situational Awareness LP, whose founder @leopoldasch wrote the definitive memo on AGI. Participants include @sparkcapital, @danielgross and @natfriedman’s fund, @patrickc and @collision, @TriatomicCap, @HarpoonVentures, @karpathy, @dwarkesh_sp, and others. We’re also welcoming investors across the supply chain, including Marvell and Alchip. @MikeGunter_ and I started MatX because we felt that the best chip for LLMs should be designed from first principles with a deep understanding of what LLMs need and how they will evolve. We are willing to give up on small-model performance, low-volume workloads, and even ease of programming to deliver on such a chip. We’re now a 100-person team with people who think about everything from learning rate schedules, to Swing Modulo Scheduling, to guard/round/sticky bits, to blind-mated connections—all in the same building. If you’d like to help us architect, design, and deploy many generations of chips in large volume, consider joining us.
121
201
2,288
3,038,435
MatX retweeted
We trained models with MXFP4-quantized attention, but it turns out this can break causal modeling. Our latest post explains why this happens and how to fix it. matx.com/research/leaky_quan…
1
17
103
32,165
MatX retweeted
I'll be in Toronto and Waterloo over the next week, I'd love to chat and tell you a bit more about what we're doing at MatX (and say hi); please feel free to reach out!
4
5
65
6,113
MatX retweeted
Excited to say I joined @MatXComputing late last year! The team is exceptionally thoughtful and the problems are both difficult and fun: from µarch, compilers, and models, to the systems we are building.
26
5
159
8,669
MatX retweeted
11 Mar 2025
MatX hardware will maximize intelligence per dollar for the world’s largest models. We are a team of 50 and growing quickly. If you are passionate about building the best chips for LLMs, consider joining us. matx.com/jobs
2
23
5,834
MatX retweeted
11 Mar 2025
MatX is designing chips and systems to 10x the computing power for the world’s largest AI workloads. Today, we are pleased to announce the closing of a >$100M Series A funding round led by @sparkcapital, with participation from @JaneStreetGroup, @danielgross and @natfriedman, @TriatomicCap, @HarpoonVentures, and @adamdangelo. In two years, we proved out all our technical bets across ML numerics, chip design and implementation, software, and system design—and secured all the necessary partnerships—to develop our chip. With this round of investment, we are now sufficiently funded to bring our systems to market.
17
20
185
35,662
MatX retweeted
28 Jan 2025
1. Breakdown of DeepSeek V3 efficiency vs Llama 3: - Better: 11x fewer FLOPs per token, thanks to MoE [37B vs 405B activated params] - Better: 2x faster numerics [fp8 vs bf16 training] - Worse: 0.5x flops utilization [16% vs 33% end-to-end MFU*] - Neutral: similar hardware platform [H800 and H100 both have 2Pflops/s dense fp8] - Neutral: same training data volume [14.8T vs 15T tokens] Llama 3’s design was obviously and intentionally conservative: dense model (not MoE), bf16 training (not fp8), GQA attention (not cheaper alternatives). DeepSeek benefited by being aggressive on all these fronts, at the cost of being later to market. 2. The core algorithmic improvements were already known; the closed source LLM labs were probably already doing similar things. DeepSeek’s improvements are real, but far more modest than the Llama comparison would suggest; my wild guess is closer to 1.5x improvement. MoE was published in 2017; in 2021 Switch Transformer reported 7x speedups vs dense models, similar to DeepSeek’s 11x. OpenAI is widely rumored to have been using MoE models for years. NVIDIA published their fp8 training paper in 2022. 3. NVIDIA’s stock price is down 15% after DeepSeek. Should it be? LLM compute is like a gas: it expands to fill the available budget. Over the last 3 years the labs have grown their budgets, despite algorithms and hardware improving. There’s no reason to expect this to change now: you win by making the best model, not by shrinking your budget. The more meaningful question: do algorithmic improvements like DeepSeek’s mean that margins will shift from hardware vendors to labs? Hard to see why. Algorithmic improvements are quickly copied from one lab to another, making it hard for them to maintain technological differentiation. Hardware improvements take much longer to copy.
7
40
383
68,335
MatX retweeted
NEW ODD LOTS: Two Veteran Chip Designers Have A Plan To Take On Nvidia @tracyalloway and I talked to @reinerpope and @MikeGunter_, both formerly of Alphabet, about their new company MatX that's aiming to build the ultimate semiconductor just for LLMs bloomberg.com/news/articles/…
6
14
85
87,463
MatX retweeted
I really enjoyed talking about the process and business of semiconductor design with @tracyalloway and @TheStalwart on the Odd Lots podcast. Joe and Tracy were wonderful hosts: They put me at ease and guided the conversation with the lightest of touch. We talked about what doing semiconductor design is like, why LLMs are hungry for as many FLOPS/$ as they can get, how @MatXComputing can provide that, and how NVIDIA's moat might be bridged. I particularly liked that @reinerpope got to communicate some of the sense of beauty that I also feel about good design. Helping lead MatX's engineering team (and meeting with our customers) is a humbling honor: It's talking with people who are the world experts in what we're chatting about. Being on Odd Lots was talking with grandmasters at conversation.
NEW ODD LOTS: Two Veteran Chip Designers Have A Plan To Take On Nvidia @tracyalloway and I talked to @reinerpope and @MikeGunter_, both formerly of Alphabet, about their new company MatX that's aiming to build the ultimate semiconductor just for LLMs bloomberg.com/news/articles/…
3
6
33
27,235
MatX retweeted
MatX will be at MLSys. Come join us at our After Hours in Santa Clara to talk about chips, compilers, partitioning, and optimizing ML models for future hardware. Many of us will be there, including me and @mikegunter_. Tuesday May 14th at 4pm, see matx.com/meetmatx.

2
21
3,358
MatX retweeted
We’re releasing seqax, a research-focused LLM codebase that is simple, explicit, and performs well on up to ~100 GPUs/TPUs. Everything you need to edit, from the math, to parallelism, to memory footprint, is all there in 500 lines of JAX code. 🧵 github.com/MatX-inc/seqax
8
48
277
43,563