Meta × TBD Lab × CMU × UChicago × UMaryland
In our latest work, we introduce
Token-Level LLM Collaboration via FusionRoute
📝:
arxiv.org/pdf/2601.05106
LLMs have come a long way, but we continue to face the same trade-off:
– one huge model that kind of does everything, but is expensive and inefficient, or
– many small specialist models that are cheap, but brittle outside their comfort zones
We’ve tried a lot of things in between — model merging, MoE, sequence-level agents, token-level routing, controlled decoding, etc.
Each helps a bit, but all come with real limitations.
A key realization behind FusionRoute is:
Pure token-level model selection is fundamentally limited, unless you assume unrealistically strong global coverage.
We show this formally. And then we fix it by letting the same router also generate.
Concretely, FusionRoute is a lightweight router LLM that
– performs token-level model selection, and
– directly contributes complementary logits to refine or correct the selected specialist when it fails
So it's not "routing another model" — the router itself is part of the decoding policy as well.
This turns token-level collaboration from a brittle "pick-an-expert" problem into a strictly more expressive policy.
No joint training of specialized models.
No model merging.
No full multi-agent rollouts.
In our experiments, FusionRoute works across math, coding, instruction following, and consistently outperforms sequence-level collaboration, prior token-level methods, model merging, and even direct fine-tuning.
Feeling especially timely as LLM systems (e.g., GPT-5) move toward routing-based, heterogeneous model stacks (whether prompt-level or test-time).