Mark Vero

Mark Vero

9 Photos and videos

Tweets

Pinned Tweet

Mark Vero @mark_veroe

May 29

1/ LLMs are increasingly being used to power high-interaction honeypots while maintaining a low security risk. But how good are they really? To answer this question, we introduce Honeyval, the first comprehensive eval framework for LLM-powered honeypots.

1,073

Jasper Dekoninck

Mark Vero retweeted

Jasper Dekoninck @j_dekoninck

Jun 17

Introducing two new research-level mathematical training datasets! Training data for research mathematics, especially in the post-training regime, is severely lacking. Using our benchmark pipelines on a larger scale, we now created almost 6,000 training data samples.

6,735

Mark Vero

Mark Vero retweeted

Mark Vero @mark_veroe

May 29

1,073

Kazuki Egashira

Mark Vero retweeted

Kazuki Egashira @kazukiega

Jun 8

Finally cleared the last hurdle! Our latest quantization-conditioned attack works against almost every popular quantization method, including GPTQ, AWQ! "Widening the Gap: Exploiting LLM Quantization via Outlier Injection" arxiv.org/abs/2605.15152

11,126

Mark Vero

Mark Vero @mark_veroe

May 29

1,073

more replies

Mark Vero

Mark Vero @mark_veroe

May 29

6/ We open source Honeyval, hoping to standardize LLM-powered honeypot eval and provide a basis for incremental progress on LLM-powered honeypots. Code: github.com/google-research/h… Website Leaderboard: honeyval.xyz/ Paper: arxiv.org/abs/2605.29963

GitHub - google-research/honeyval

Contribute to google-research/honeyval development by creating an account on GitHub.

github.com

118

Mark Vero

Mark Vero @mark_veroe

May 29

7/ I worked on Honeyval during my research internship at @Google with the amazing collaborators: Fabian Kaczmarczyck, Ivan Petrov, @iliaishacked, Jamie Hayes, Niels Heinen, Tianqi Fan, @invernizzi, and @mvechev across @Google, @GoogleDeepMind, @aisequrity, and @the_sri_lab.

103

Ivo Petrov

Mark Vero retweeted

Ivo Petrov @IvoPetrov01

May 21

LLMs have become capable of proving complex mathematics. However, the proofs they produce vary significantly in how clear, motivated, and insightful they are. To measure these differences, we introduce ProofRank, the first benchmark to scalably evaluate aspects of proof quality.

5,313

Kazuki Egashira

Mark Vero retweeted

Kazuki Egashira @kazukiega

Apr 23

Many papers conclude that an imperfect verifier has minimal impact on RLVR training. Is that really the case? We show that, depending on the error pattern, the impact of verification error can be diverse, including delayed training, suboptimal plateaus, and complete collapse.

3,503

Niels Mündler-Sasahara

Mark Vero retweeted

Niels Mündler-Sasahara

@nielstron

Feb 23

Today is a first for me: someone (@theo) made a Youtube video about my (and @tibglo s) paper 😁 x.com/theo/status/2025900730…

Theo - t3.gg

@theo

Feb 23

You should delete your CLAUDE․md/AGENTS․md file. I have a study to prove it.

29:15

274

41,636

Jasper Dekoninck

Mark Vero retweeted

Jasper Dekoninck @j_dekoninck

Jan 16

In a new blog post, we show that API errors and retry policies have significant impact on benchmark performance! While retrying requests is ubiquitous in LLM evaluation, its effect on performance is undocumented, time-dependent, and leads to various incorrect conclusions.🧵

916

Niels Mündler-Sasahara

Mark Vero retweeted

Niels Mündler-Sasahara

@nielstron

Jan 15

📣 new submission to SWT-bench TEX-T by @SFResearch achieves 87% in script mode. Amazing to see this benchmark hike along with SWE-bench from 15% to almost 90% in the last 1.5 years. Time for new unit test benchmarks :) swtbench.com

SWT-Bench: Assessing capabilities at Unit Test Generation

Check out the SWT-Bench leaderboard! SWT-Bench is a benchmark designed to assess the capabilities of large language models and Code Agents in generating unit tests on real-world code repositories,...

swtbench.com

1,138

Niels Mündler-Sasahara

Mark Vero retweeted

Niels Mündler-Sasahara

@nielstron

27 Dec 2025

1/🧵 LLMs can write their own benchmarks to uncover security vulnerabilities! We leverage LLMs to expand BaxBench with 40 entirely novel, complex web backend tasks, more than doubling the original benchmark, resulting in AutoBaxBench. These tasks include extensive test cases and end-to-end exploits to expose vulnerabilities in implementations, which we confirm match or even outperform human-written exploits.

610

Mark Vero

Mark Vero @mark_veroe

10 Dec 2025

🏆New #1 on the BaxBench leaderboard!🏆 Claude Opus 4.5 tops the BaxBench leaderboard with a striking pass@1 score of 86.2% and secure_pass@1 of 56.1%. Most impressively, the secure_pass@1 score improves ~10% upon simply reminding Claude to generate secure code.

3,312

more replies

Mark Vero

Mark Vero @mark_veroe

11 Dec 2025

All changes to the leaderboard can be tracked in the versioning of the website: github.com/eth-sri/baxbench-…

GitHub - eth-sri/baxbench-website

Contribute to eth-sri/baxbench-website development by creating an account on GitHub.

github.com

Mark Vero

Mark Vero @mark_veroe

11 Dec 2025

See the full leaderboard on our website: baxbench.com Check out and contribute to BaxBench’s source: github.com/logic-star-ai/bax…

BaxBench: Can LLMs Generate Secure and Correct Backends?

We introduce a novel benchmark to evaluate LLMs on secure and correct code generation, showing that even flagship LLMs are not ready for coding automation, frequently generating insecure or incorrect...

baxbench.com