Strong Model Alone = 70
Strong Model Good Harness = 85
Strong Model Good Harness Expert = 100
As far as we can tell, no. There is only anecdotal evidence, along with claims from AI pentesting vendors.
If a strong model can do everything by itself, then what exactly have these vendors been building? It is understandable that people would prefer a story in which the harness, workflow, and surrounding infra matter a great deal. It's also why people keep flexing "0-days" in OpenSSL, FFmpeg, or nginx, despite limited real-world impact.
That said, Niels Provos was not trying to sell anything, and he and several people have reported good results with IronCurtain despite using relatively weak models.
Most importantly, what Google achieved with Chrome suggests that a good harness may be quite valuable. Google does not appear to have access to anything more capable than Mythos, which means they likely scanned Chrome using Mythos itself or something less powerful. Yet they still uncovered hundreds of bugs.
There is, however, another explanation. Google may simply have better Chrome/V8 experts who can extract more value from Mythos. This remains our preferred hypothesis.
What provides a real advantage: domain knowledge accumulated over many years, or a harness vibe-coded in an afternoon? We think the answer is fairly obvious.