Given all the recent discussion around open-weight models and cyber capabilities, I ran a small experiment to understand a bit better how close they are to frontier models on vulnerability research.
I tested 5 open-weight models: DeepSeek V4 Pro, Qwen3.5, Kimi K2.6, GLM-5, and GLM-5.1 against Opus 4.7.
The setup is Sendmail crackaddr() bug. Four variants the original source, a rewritten equivalent, a compiled binary, and a Tigress-obfuscated stripped binary.
A few things stood out:
- With plain Claude Code as the harness, most open-weight models still trail Opus on the harder artifacts. The exception is GLM-5.1, which matches Opus across the board.
- The failure modes are maybe more interesting than the raw pass/fail results. The open models tend to reach for fuzzing much earlier, rarely build oracles, and show weaker pattern matching. This looks more like a post-training issue than an architecture issue.
- The harness matters a lot. Swapping plain Claude Code for
@NielsProvos IronCurtain closes most of the gap. With the new memory-safety-c-cpp skill, Kimi and Qwen go from 0/2 to 2/2 on the compiled and obfuscated binaries.
- GLM-5 vs GLM-5.1 is the cleanest comparison: same base model, same architecture, different post-training regime, very different bug-finding behavior. CyberGym goes from 48.3 to 68.7, with only ~6 weeks between the two releases.
The policy implications are interesting. GLM-5 was reportedly trained entirely on Huawei hardware, which complicates the GPU export-control story. More broadly, the results suggest the gap between open-weight and SOTA models on offensive cyber may be cheaper and easier to close than many assume.
Full writeup:
vincenzoiozzo.com/blog/oss-m…