another banger from
@pupposandro and the
@luceboxai team
Luce Spark runs Laguna XS.2 in 14.6 GiB at ~100 tok/s on an RTX 3090, versus ~119 tok/s fully resident.
you can now run Laguna below the 16 GiB line and use it for local evals, agent traces, routing analysis, quantization, and serving experiments.
Excited to launch Luce Spark: now a 35B MoE runs on a 16GB GPU, with no offload tax.
An A3B model fires ~8 of its 256 experts per token, but to keep it resident you pay VRAM for all 256. Spark pins the experts your traffic actually hits, offloads the rest to CPU, and decodes the whole token in one fused graph, so offload stops costing speed.
▸ Qwen3.6 35B-A3B: ~20.5 → 13.3 GiB
▸ Laguna XS.2 33B-A3B: 18.8 → 14.6 GiB
Decode holds ~100 tok/s, close to the 119 you get with every expert resident on a 24 GB card. No calibration step. It tunes itself from live traffic.