mobicham

mobicham

76 Photos and videos

Tweets

mobicham @mobicham

May 24

Can't wait for something like Claude Code for Ableton 👀, the time to manually eq and edit MIDI should be over

317

mobicham

mobicham @mobicham

May 15

github.com/dropbox/gemlite/t… You can now easily load various pre-quantized models (block FP8, NVFP4, AWQ/GPTQ/HQQ, certain GGUFs) via a vLLM plugin! You can also run on-the-fly quant as well, easy to use: 1 or 2 flags to enable!

1,927

mobicham

mobicham @mobicham

May 13

There's a very simply one-shot trick that seems to improve the quality of low-bit weight quantization by quite a bit in some cases: simply reordering the rows. It doesn't require changing the matmul kernel, only reshuffling the activations.

320

mobicham

mobicham @mobicham

May 12

Skimming through the PolarQuant paper by Google and found HQQ is still alive 👀

1,823

mobicham

mobicham @mobicham

Apr 24

Babe, wake up, new GemLite update. Up to 1.7x faster FP8 block quantization on the RTX PRO 6000 end-2-end in vLLM!

0:41

2,404

mobicham

mobicham @mobicham

Apr 23

What kind of FP4 format does the new TPU8 use? MXFP4 quality is pretty poor, and NVFP4 is specific to Nvidia, so I'm guessing it uses a smaller group size (<32) to achieve better quality ? 🤔

339

mobicham

mobicham @mobicham

Apr 17

Some great on-device multi-vector work at Dropbox, check it out!

Josh Clemm

@joshclemm

Apr 16

Open sourcing something fun from @Dropbox: Witchcraft. It's a local search engine built in Rust with no API keys or vector DB required. Think: ColBERT / late interaction style retrieval, but packaged to run locally (perfect for coding agents). Let's dive in👇

255

mobicham

mobicham @mobicham

Apr 4

Running Bonsai-1.7B 1-bit model at 660 tokens/sec decoding speed with Gemlite on the RTX 5090 🫡

3,108

mobicham

mobicham @mobicham

Apr 2

After a massive refactoring to improve sm_120 perf, vLLM with GemLite now outperforms vLLM's NVFP4 especially at decoding! Will be available in the next release soon 🫡

5,583

mobicham

mobicham @mobicham

Apr 2

Already available in master: github.com/dropbox/gemlite

GitHub - dropbox/gemlite: Fast low-bit matmul kernels in Triton

Fast low-bit matmul kernels in Triton. Contribute to dropbox/gemlite development by creating an account on GitHub.

github.com

226

mobicham

mobicham @mobicham

Mar 31

Little tip to make MXFP8 faster: you actually don't really need the activations block scales. Instead, you can simply use the identity scales in the mma op and use channel-wise post-scaling after the accumulation loop. No big accuracy loss🫡

911

mobicham

mobicham @mobicham

Mar 30

Friendly reminder that QuaRot had a full 4-bit LLM solution back in 2024: weights, activations and KV cache, all in INT4, running on Ampere with int4 mma with actual massive speed-up, before 4-bit KV cache was cool.

914

mobicham

mobicham @mobicham

Mar 11

Triton ships ptxas 12.9 for Blackwell, but CUDA 13.0 ptxas adds support for e2m1x2.f16x2 which makes activation quant go brrr. However, it seems that ptxas 13.0 actually generates worse kernels, typically with large M 🤔

1,735

mobicham

mobicham @mobicham

Mar 12

This bit flip might be the issue

206

mobicham

mobicham @mobicham

Mar 11

Little trick to outperform Cutlass for NVFP4 on sm_120: use mixed TMA: because TMA requires padding to 128, I don't use it for the activation scales, resulting in a huge bump for decoding speed🫡!

4,645

mobicham

mobicham @mobicham

Mar 7

🫡

556

mobicham

mobicham @mobicham

Mar 6

🫡

GPU MODE

@GPU_MODE

Mar 6

Our next kernel competition is now open for submissions! A $1.1M cash prize competition sponsored by AMD on optimizing DeepSeek-R1-0528, GPT-OSS-120B on MI355X Registration: luma.com/cqq4mojz

449

mobicham

mobicham @mobicham

Mar 3

Finally squeezing some time to revisit GemLite 🫡

1,182

mobicham

mobicham @mobicham

Feb 19

Simple trick to boost performance on sm_120 with vLLM: switch to Flashinfer or Triton attention. The default backend is about 15% slower end-2-end 🫡 Still waiting for mxfp8 attention to make sm_120 go brrr👀

842

mobicham

mobicham @mobicham

Feb 16

New blogpost 🫡! dropbox.tech/machine-learnin…

How low-bit inference enables efficient AI

Making products like Dropbox Dash accessible to individuals and businesses means tackling new challenges around efficiency and resource use.

dropbox.tech

781