A super cool project, not only for future LLM agent benchmarking, but also for SAST in general. Looking forward to see what gets built on top of this!
Excited to announce my preprint "eyeballvul: a future-proof benchmark for vulnerability detection in the wild". I create a benchmark to evaluate the vulnerability detection capabilities of long-context models on entire codebases, containing over 24,000 vulnerabilities, then evaluate 7 leading long-context models on it.