Loaded qwen3.6 27b on a single 3090, hooked it into Zed alongside Claude cli, and ran them head-to-head on a real task.
setup: I'd been doing some research on one of my apps using openclaw qwen3.6 27b and wanted to redesign the app based on it, so I dropped that research locally and asked both Sonnet4.6 and qwen3.6 to read it alongside my current codebase and draft a redesign plan. Worth flagging that the research itself was generated by qwen3.6 via openclaw... I haven't gone through it line by line, so I can't fully vouch for how well it's drafted, but both models were reading the same source so it's still a fair comparison.
Tool calls were identical. File lookups were identical. For a minute I genuinely thought "damn, it's actually competing."
then I read the outputs. qwen3.6 drafted generic architecture and code-change recommendations, boilerplate refactor stuff. sonnet actually engaged with the research and produced a redesign plan that aligned with the intent of the work.
On paper qwen3.6 looks like it almost beats frontier models. In practice the reasoning and understanding aren't there yet. Oh, and it took an extra 20 minutes stuck in thinking loops to get there... Benchmarks lie.
Code-gen comparison dropping next.
(video has no audio... accepting voiceover apps in the replies, credit guaranteed)