Final recommended llama.cpp config
```bash
./llama-server \
-m Qwen3.6-27B-Q4_K_M.gguf \
-ngl 99 --split-mode layer --tensor-split 58,42 \
-c 102400 \
-fa on -ctk q4_0 -ctv q4_0 \
--spec-type draft-mtp --spec-draft-n-max 3 --spec-draft-p-min 0.75 \
-ub 256 \
-np 1 --jinja -t 8 \
--host 127.0.0.1 --port 8080
```
The flags that carry the findings:
- `-ub 256` β *the* result (5β6): halves the per-prefill compute-graph reserve (~1.8GB), frees ~1.4GB on the choking card. This, not KV-quant, unlocks deep ctx.
- `--tensor-split 58,42` β weight off GPU1 because MTP pins the draft head growing draft-KV there (7); whole-layer quantized, ~170MiB/step.
- `-ctk/-ctv q4_0` β sane default, but *not* the deep-ctx fix (5).
- `--spec-type draft-mtp --spec-draft-n-max 3` β native MTP, no separate draft model (2).
- `-c 102400` β 100k, the honest ceiling (8); config D ran 0β94k clean. (Config E's `-c 153600` / `-ts 54,46` was only to trace the throughput curve, not for serving.)
# Power cap is NOT a llama.cpp flag β set it first (250W = efficiency knee, ~96% perf / β17% power):
sudo nvidia-smi -pl 250 # both cards