mlx-vlm v0.6.3 is here 🚀
Day-0 support for TWO new models from our partners we work closely with:
🔥
@GoogleDeepMind DiffusionGemma — a genuinely new architecture. Instead of token-by-token, it generates 256-token blocks in parallel with bi-directional attention and iteratively self-corrects the whole block, image-generator style. 26B MoE, only 3.8B active, fits in 18GB quantized. Day-0 MLX support via our Google DeepMind partnership, with long-context prefill tuned and ready.
🔥
@cohere's North Mini Code 1.0 — a 30B MoE with just 3B active, running ~66 tok/s in BF16 before any compression. Day-0 on MLX thanks to our close collaboration with the Cohere team.
Get started today — install from source:
> uv pip install -U mlx-vlm
Then serve the model and point your favorite agent at it (pi, opencode, hermes, etc.):
uv run mlx_vlm.server --model MODEL-REPO
Model collection 👇🏽
Meet DiffusionGemma!
An experimental open model that explores a fast approach to text generation, released under an Apache 2.0 license.
Moving beyond sequential, token-by-token processes to generate entire blocks of text simultaneously. Here’s what’s new with DiffusionGemma: 👇