Joined November 2023
76 Photos and videos
Can't wait for something like Claude Code for Ableton 👀, the time to manually eq and edit MIDI should be over
4
317
github.com/dropbox/gemlite/t… You can now easily load various pre-quantized models (block FP8, NVFP4, AWQ/GPTQ/HQQ, certain GGUFs) via a vLLM plugin! You can also run on-the-fly quant as well, easy to use: 1 or 2 flags to enable!

1
5
19
1,927
There's a very simply one-shot trick that seems to improve the quality of low-bit weight quantization by quite a bit in some cases: simply reordering the rows. It doesn't require changing the matmul kernel, only reshuffling the activations.
8
320
Skimming through the PolarQuant paper by Google and found HQQ is still alive 👀
1
6
22
1,823
Babe, wake up, new GemLite update. Up to 1.7x faster FP8 block quantization on the RTX PRO 6000 end-2-end in vLLM!
1
3
32
2,404
What kind of FP4 format does the new TPU8 use? MXFP4 quality is pretty poor, and NVFP4 is specific to Nvidia, so I'm guessing it uses a smaller group size (<32) to achieve better quality ? 🤔
5
339
Some great on-device multi-vector work at Dropbox, check it out!
Open sourcing something fun from @Dropbox: Witchcraft. It's a local search engine built in Rust with no API keys or vector DB required. Think: ColBERT / late interaction style retrieval, but packaged to run locally (perfect for coding agents). Let's dive in👇
2
255
Running Bonsai-1.7B 1-bit model at 660 tokens/sec decoding speed with Gemlite on the RTX 5090 🫡
1
5
21
3,108
After a massive refactoring to improve sm_120 perf, vLLM with GemLite now outperforms vLLM's NVFP4 especially at decoding! Will be available in the next release soon 🫡
4
4
68
5,583
Little tip to make MXFP8 faster: you actually don't really need the activations block scales. Instead, you can simply use the identity scales in the mma op and use channel-wise post-scaling after the accumulation loop. No big accuracy loss🫡
1
1
17
911
Friendly reminder that QuaRot had a full 4-bit LLM solution back in 2024: weights, activations and KV cache, all in INT4, running on Ampere with int4 mma with actual massive speed-up, before 4-bit KV cache was cool.
15
914
Triton ships ptxas 12.9 for Blackwell, but CUDA 13.0 ptxas adds support for e2m1x2.f16x2 which makes activation quant go brrr. However, it seems that ptxas 13.0 actually generates worse kernels, typically with large M 🤔
4
28
1,735
This bit flip might be the issue
1
206
Little trick to outperform Cutlass for NVFP4 on sm_120: use mixed TMA: because TMA requires padding to 128, I don't use it for the activation scales, resulting in a huge bump for decoding speed🫡!
3
3
86
4,645
🫡
14
556
🫡
Our next kernel competition is now open for submissions! A $1.1M cash prize competition sponsored by AMD on optimizing DeepSeek-R1-0528, GPT-OSS-120B on MI355X Registration: luma.com/cqq4mojz
4
449
Finally squeezing some time to revisit GemLite 🫡
3
12
1,182
Simple trick to boost performance on sm_120 with vLLM: switch to Flashinfer or Triton attention. The default backend is about 15% slower end-2-end 🫡 Still waiting for mxfp8 attention to make sm_120 go brrr👀
2
14
842