the thing is, based on how everything works, and i can't believe I'm saying this sentence, but it's basically impossible
see, we now have a recent example of what happens when you go "beast mode" on safety maxxing with Fable, and it happens that you end up with many instances of what I call "spurious correlations"
it's the same stuff we know from older models from a year or so ago, where mention of words like "assembly" and such triggers their safety training and it misinterpret it as "hacking"
but then let's just say you can do perfect scope detection, right. now you can get into creative solution that really stress test this, and we enter the territory of roleplaying.
with roleplaying, you can basically do anything, heck you can even improve context following and whatnot of the model.
at some point, you end up creating a jailbreak, because a jailbreak at it's core isn't just "do like me" or "act like xyz" or "be bad" but, is actually changing the default mode/setting of the model, that is it's own amalgam of roles it extrapolated from the datasets
my theory on the roleplaying part is, I bet it's just mainly caused from the attention heads, where certain tokens create attention sinks temporarily, and improve too much ICL on those tokens vs all the other ones or context as a whole
now, bear with me here, you can however, do scope detection but delegated to an external instance of either the same model or otherwise, but then it's just "llm judge rubrics". it works great! except, it doesn't when your main product is a "model with general capabilities"
it would be funny if anthropic had to actually make good on their ai safety performance art by solving jailbreaks for cyber-capabilities entirely
surely the only solution for that won't be to completely neuter cyber-capabilities entirely, right?
surely they have SOTA safety techniques in their back-pockets that can surgically accomplish this? unlike the thin veils the schmucks at other labs put out?
would be the greatest ai safety advancement of all time if so!