OS hacker GPU optimization

Joined October 2009
5 Photos and videos
Haohui Mai retweeted
.@BernieSanders , it is a time to celebrate. @elonmusk has created enormous value for society by building @SpaceX, driving down the cost of rocket launches and creating a global satellite communication network that has brought high speed, low-cost internet and communication access to hundreds of millions and eventually billions of people along with critical advantages for our military and our nation’s defense. SpaceX and its technologies will cause an acceleration in the growth of wages and wealth creation globally, including in some of the poorest communities in the U.S. and around the world. Access to low-cost, high speed communications everywhere will allow children around the world to be educated, families to build businesses, and life-saving medical knowledge and care to be available everywhere. SpaceX will materially bring down the cost of compute, advancing AI and humanity. Meanwhile, 4,000 SpaceX employees yesterday became millionaires, including hourly wage employees who you claim you are trying to help. The Elon Musks of the world drive growth, global GDP, and provide access to goods and services at lower cost that would otherwise not exist. Elon’s nominal trillionaire status is due to his ownership of SpaceX, Tesla, Neuralink, the Boring Company and his other initiatives that have brought new technologies that improve our everyday lives. Elon is not sitting on a trillion dollar pile of cash, jewelry and gold. He is using his controlling stakes in his companies to advance mankind. Elon’s companies don’t pay dividends. They reinvest all of their capital to accelerate innovation and value creation. Elon is working 24/7 for all of us. He deserves respect and appreciation, not smears. Bernie, your socialism would never allow a SpaceX to be built. Socialism has only proven to impoverish mankind and lead to death and destruction. We need to create the conditions for more SpaceXs to be built, not attack the great entrepreneurs who are helping to advance our country.
3,595
8,859
47,089
2,499,154
Haohui Mai retweeted
RT appreciated. Anyone looking for an excellent Linux kernel developer? Ruowen (@chinqrw) is one of the best. He is on the market due to the shutdown of Red Hat China. He's mainly looking in China, but also open to jobs elsewhere. He co-leads the Rex project (github.com/rex-rs/rex) with @Jinghao_J which they started it at UIUC. He also has extensive experience working on Red Hat's kernel-QE. I worked with Ruowen as my TA of CS 423 and on the Rex project. He is great!
I am currently seeking opportunities in Linux kernel development, eBPF, Rust-for-Linux, or related platform work. DM or chinqrw@gmail.com — happy to share the resume. #LinuxKernel #eBPF #Rust #RustForLinux #OpenToWork
1
8
29
6,423
LLMs can write GPU kernels, but they still struggle to make them assembly-fast. Real-world performance requires complex, tightly coupled optimizations across the whole kernel. ARGUS is the first agentic framework to achieve assembly-fast performance on real-world GPU kernels. On AMD MI300X, it reaches 99–104% of hand-optimized assembly throughput on GEMM, FlashAttention, and fused MoE, while running 2–1543× faster than existing agentic systems. ARGUS makes these global properties explicit through data-flow invariants. These invariants specify what should match at key program points, such as ensuring tensor core instructions see consistent matrix operands despite changes to swizzled memory layouts, tiling, and pipelining. That gives both the compiler and the LLM dense guidance beyond sparse unit tests, verified at compile time with abstract interpretation and SMT solving. arxiv.org/abs/2604.18616
7
19
1,194
It turns out that the only reliable network connection on plane is UDP. TCP over UDP saves the day!
114
Haohui Mai retweeted
I wrote a post-mortem article on how glitches in an AI paper writing assistant tool in the last 30 minutes caused my group a missed SOSP deadline that we worked on for more than a year. open.substack.com/pub/yiying…

2
3
33
7,269
Haohui Mai retweeted
We want to speak directly to the concern many of you have expressed, and we owe you a clear explanation of what happened, why it happened, and where we stand now. We understand this situation caused genuine alarm and we take that seriously. In preparing the NeurIPS 2026 handbook, we included a link to a US government sanctions tool that covers a significantly broader set of restrictions than those NeurIPS is actually required to follow. This error was due to miscommunication between the NeurIPS Foundation and our legal team; there was never an intention to restrict participation beyond our mandatory compliance obligations. The responsibility for that error is ours as an organization, and we deeply apologize for the alarm and impact this miscommunication had on our community. We have updated the link and clarified the text of our policy, which is consistent with that of ACM and IEEE, as well as other international conferences and NeurIPS in the past. As in previous years, NeurIPS welcomes submissions from all compliant institutions and individuals. We want to reiterate that NeurIPS is a community-driven event, created by and for the community, and strives to be inclusive. The NeurIPS 2026 organizing committee was particularly saddened to learn of this institutional miscommunication. The organizing committee has taken on the responsibility of running the conference this year with the goal of fostering open communication, knowledge sharing, and global scientific discourse. We thank the community for bringing this issue to our attention and working with us through this situation.
264
127
503
498,059
Haohui Mai retweeted
if you’re a CS/EE student write your thesis on JIT compilation of eBPF for NVMe controllers there’s huge career alpha in computational storage; the standards are *just* starting to exist (TP4091)
37
253
5,084
236,910
Haohui Mai retweeted
HEARTBREAKING: Ex-PhD student Brendt Christensen found GUILTY of posing as cop, luring, abducting, R*ping & d*capitating Chinese scholar Yingying Zhang in his apartment in 2017. Her dism*mbered remains STILL missing. Never forget Yingying’s story.
314
4,213
27,301
823,843
Haohui Mai retweeted
The paper is now available: huggingface.co/papers/2602.0… More updates coming soon!
Holiday cooking finally ready to serve! 🥳 Introducing DFlash — speculative decoding with block diffusion. 🚀 6.2× lossless speedup on Qwen3-8B ⚡ 2.5× faster than EAGLE-3 Diffusion vs AR doesn’t have to be a fight. At today’s stage: • dLLMs = fast, highly parallel, but lossy • AR LLMs = accurate, sequential, but slow DFlash = diffusion drafts, AR verifies.
6
42
304
40,421
Haohui Mai retweeted
Sounds incredible until you read the fine print. The compiler generates less efficient code than GCC with all optimizations disabled. It doesn’t have its own assembler or linker. It can’t produce a 16-bit x86 code generator. And Carlini himself says it has “nearly reached the limits of Opus’s abilities.” New features and bugfixes kept breaking existing functionality. So what did $20,000 and two weeks actually buy? A compiler that passes 99% of GCC’s torture tests but can’t match the output quality of a tool that’s had 37 years of human engineering. That’s the constraint nobody’s pricing in. The real story is in the cost curve, not the capability demo. $20,000 for 100,000 lines means $0.20 per line of generated code. A senior compiler engineer costs roughly $150/hour. At maybe 50 polished lines per hour for something this complex, that’s $3/line. AI just did it at 15x cheaper, and it will only get cheaper from here. But the code isn’t equivalent. The AI version needs a human to finish the assembler, fix the linker, optimize the output, and prevent regressions. Those are the hardest 20% of the problem, and they represent 80% of the engineering value. Anthropic built the demo. Shipping the product still requires humans. This tells you exactly where we are in the autonomous software timeline. AI can now produce impressive first drafts of complex systems at trivial cost. Turning those drafts into production software still requires the judgment that costs $300K per year in compiler engineer salary. The gap between “compiles the Linux kernel” and “replaces GCC” is measured in decades of accumulated engineering wisdom that no model has internalized yet. The companies that understand this will use agent teams to generate the 80% and hire engineers to finish the 20%. The companies that don’t will ship $20,000 compilers that produce slower code than a free tool from 1987.
New Engineering blog: We tasked Opus 4.6 using agent teams to build a C compiler. Then we (mostly) walked away. Two weeks later, it worked on the Linux kernel. Here's what it taught us about the future of autonomous software development. Read more: anthropic.com/engineering/bu…
186
308
2,316
374,561
Haohui Mai retweeted
19 Dec 2025
Performance Hints Over the years, my colleague Sanjay Ghemawat and I have done a fair bit of diving into performance tuning of various pieces of code. We wrote an internal Performance Hints document a couple of years ago as a way of identifying some general principles and we've recently published a version of it externally. We'd love any feedback you might have! Read the full doc at: abseil.io/fast/hints.html
106
1,080
7,654
2,113,149
12 Nov 2025
I really hope that AMD can spend some love on their out-of-the-box experience for inference. For example, the latest sglang v0.5.5 docker image is broken for a whole week due to github.com/sgl-project/sglan…. Maybe it's time to add some smoke tests
1
163
Haohui Mai retweeted
🚀 End the GPU Cost Crisis Today!!! Headache with LLMs lock a whole GPU but leave capacity idle? Frustrated by your cluster's low utilization? We launch kvcached, the first library for elastic GPU sharing across LLMs. 🔗 github.com/ovg-project/kvcac… 🧵👇 Why it matters:
9
54
199
74,054
Haohui Mai retweeted
22 Sep 2025
How do you run FP4 models on AMD MI250/MI300 without waiting for MI350? The CausalFlow team @wheat9 built Petit, optimized mixed-precision kernels co-designed with AMD’s MatrixCore. Benchmarks: 🔹 1.74× faster Llama-3.3-70B inference 🔹 3.7× faster GEMM vs hipBLASLt Open-sourced integrated into SGLang v0.4.10. See the full blog👇
1
4
25
3,068
Haohui Mai retweeted
“Greatness does not come out of intelligence, it comes from character. Character is not formed out of smart people: it is formed out of people who have suffered.” — Nvidia CEO, Jensen Huang
122
2,344
14,117
608,714
Haohui Mai retweeted
For everyone saying this is not a time for blame, this is exactly the moment that the people who have been responsible for the mismanagement of California's fire policies be held accountable. Our governor has been obsessed with holding a special session to "Trump-proof" CA, when he should be focused on FIRE-PROOFING our communities. Here are a few hard facts about the democrat-run state's failures to address fire prevention:
1,243
12,750
57,523
3,338,632
Haohui Mai retweeted
🚀Making cross-engine LLM serving programmable. Introducing LLM Microserving: a new RISC-style approach to design LLM serving API at sub-request level. Scale LLM serving with programmable cross-engine serving patterns, all in a few lines of Python. blog.mlc.ai/2025/01/07/micro…
31
64
18,519
15 Dec 2024
A perfect example of getting an advanced degree does not fix your own bias / ignorance. For perspective students entering the academia, you will see good amount of these people and please don’t be one of those
Thanks to @RosalindPicard for bringing up China’s massive problem with scientific fraud in her NeurIPS talk. More must be done to combat it, starting with changing the culture that fosters it.
1
215
Haohui Mai retweeted
I’m concerned to see Pat Gelsinger ousted as Intel CEO. He wasn’t a firebrand visionary, and it wasn’t exactly going great, but he was deeply technical, and I don’t expect his replacement to equal him there. “Business harder” isn’t going to return Intel to greatness, only technical achievement will.
244
350
5,787
591,891
Haohui Mai retweeted
5 Sep 2024
“Mean response time was 90ms.” “Servers have gotten crazy fast!” Web developers really live in their whole little universe. Meanwhile my single node MMO server processes 200k packets/second across 6 cores for 131k player slots on only 1Gbps. It must handle pathfinding, interest management, hiscores sorting, gameplay, database, etc. Stop wasting money on bloated tech stacks.
4 Sep 2024
Basecamp did 5,250 req/sec at peak yesterday. Mean response time was 90ms. So call that needing 500 cores at max load. If you skipped redundancy, you could probably do that with 3 boxes each running a Z5 192-core AMD chip with room to spare. Servers have gotten crazy fast!
89
210
4,139
689,614