github.com/dropbox/gemlite/t…
You can now easily load various pre-quantized models (block FP8, NVFP4, AWQ/GPTQ/HQQ, certain GGUFs) via a vLLM plugin! You can also run on-the-fly quant as well, easy to use: 1 or 2 flags to enable!
There's a very simply one-shot trick that seems to improve the quality of low-bit weight quantization by quite a bit in some cases: simply reordering the rows.
It doesn't require changing the matmul kernel, only reshuffling the activations.
What kind of FP4 format does the new TPU8 use? MXFP4 quality is pretty poor, and NVFP4 is specific to Nvidia, so I'm guessing it uses a smaller group size (<32) to achieve better quality ? 🤔
Open sourcing something fun from @Dropbox: Witchcraft.
It's a local search engine built in Rust with no API keys or vector DB required.
Think: ColBERT / late interaction style retrieval, but packaged to run locally (perfect for coding agents).
Let's dive in👇
After a massive refactoring to improve sm_120 perf, vLLM with GemLite now outperforms vLLM's NVFP4 especially at decoding!
Will be available in the next release soon 🫡
Little tip to make MXFP8 faster: you actually don't really need the activations block scales. Instead, you can simply use the identity scales in the mma op and use channel-wise post-scaling after the accumulation loop. No big accuracy loss🫡
Friendly reminder that QuaRot had a full 4-bit LLM solution back in 2024: weights, activations and KV cache, all in INT4, running on Ampere with int4 mma with actual massive speed-up, before 4-bit KV cache was cool.
Triton ships ptxas 12.9 for Blackwell, but CUDA 13.0 ptxas adds support for e2m1x2.f16x2 which makes activation quant go brrr.
However, it seems that ptxas 13.0 actually generates worse kernels, typically with large M 🤔
Little trick to outperform Cutlass for NVFP4 on sm_120: use mixed TMA: because TMA requires padding to 128, I don't use it for the activation scales, resulting in a huge bump for decoding speed🫡!
Our next kernel competition is now open for submissions! A $1.1M cash prize competition sponsored by AMD on optimizing DeepSeek-R1-0528, GPT-OSS-120B on MI355X
Registration: luma.com/cqq4mojz
Simple trick to boost performance on sm_120 with vLLM: switch to Flashinfer or Triton attention.
The default backend is about 15% slower end-2-end 🫡
Still waiting for mxfp8 attention to make sm_120 go brrr👀