78% on OSWorld collapsing to 41% when GUI and CLI have to be interleaved in one trajectory tells you exactly what frontier models are missing.
Benchmarks that test one modality at a time are measuring a capability that almost no real task actually requires.
WeaveBench
Microsoft Research Asia introduces
114 long-horizon tasks that force agents
to interleave GUI and CLI in one trajectory.
The same frontier models
that score over 78% on OSWorld-Verified
collapse to 41.2% on WeaveBench.