Mark Vero

Mark Vero

18 Photos and videos

Tweets

Martin Vechev retweeted

Mark Vero @mark_veroe

May 29

1/ LLMs are increasingly being used to power high-interaction honeypots while maintaining a low security risk. But how good are they really? To answer this question, we introduce Honeyval, the first comprehensive eval framework for LLM-powered honeypots.

1,073

Ivo Petrov

Martin Vechev retweeted

Ivo Petrov @IvoPetrov01

May 21

LLMs have become capable of proving complex mathematics. However, the proofs they produce vary significantly in how clear, motivated, and insightful they are. To measure these differences, we introduce ProofRank, the first benchmark to scalably evaluate aspects of proof quality.

5,313

Jasper Dekoninck

Martin Vechev retweeted

Jasper Dekoninck @j_dekoninck

Mar 13

How often do LLMs claim to prove false mathematical statements? In our latest benchmark, BrokenArXiv, we find they do so very often. The best model, GPT-5.4, only rejects 40% of incorrect statements obtained by perturbing recent ArXiv papers, and other models do much worse.

116

819

86,196

Martin Vechev

Martin Vechev

@mvechev

Feb 24

some of our work on (not) using CLAUDE.md/AGENTS.md has been widely profiled recently.

Overview - Claude Code Docs

Claude Code is an agentic coding tool that reads your codebase, edits files, runs commands, and integrates with your development tools. Available in your terminal, IDE, desktop app, and browser.

code.claude.com

Theo - t3.gg

@theo

Feb 23

You should delete your CLAUDE․md/AGENTS․md file. I have a study to prove it.

29:15

3,338

Niels Mündler-Sasahara

Martin Vechev retweeted

Niels Mündler-Sasahara

@nielstron

Feb 17

Are AGENTS actually useful for coding agents? In our latest preprint, we analyze the effect of context files on coding model performance. Short TLDR: It depends. Manually-written files help, while LLM-generated files (e.g. by the agent) don't. More details in the thread 🧵

8,625

Jasper Dekoninck

Martin Vechev retweeted

Jasper Dekoninck @j_dekoninck

28 Nov 2025

How far can we push LLM performance on Project Euler (PE), a set of challenging mathematical programming problems? In a new blog, we explore this by designing specialized agents. With the help of the PE community, we also perform a qualitative analysis of model performance. 🧵

2,433

Martin Vechev

Martin Vechev

@mvechev

18 Nov 2025

Grateful that Google gave us access to Gemini 3 to evaluate its performance on MathArena (math performance) before the release and these results are now included in their release and model card! Gemini 3 is topping all categories on MathArena, great work!

Nikola Jovanović @ni_jovanovic

18 Nov 2025

Full @GoogleAI Gemini 3 results are up on MathArena: ➡️ #1 on 2025 Final-Answer Competitions ➡️ #1 on Apex: 5.2% -> 23.4% new SOTA ➡️ #1 on Visual Math: 79% -> 84% new SOTA ➡️ #2 on Project Euler: 62%, huge jump compared to 2.5 Pro (15%)

950

INSAIT Institute

Martin Vechev retweeted

INSAIT Institute

@INSAITinstitute

21 Oct 2025

🎥 Can AI video models truly understand physics? The newly released Physics-IQ benchmark, developed by INSAIT and Google DeepMind and led by Saman Motamed, PhD student at INSAIT, has sparked major discussion across the AI community following its presentation at #ICCV2025. 🔬 The work marks a significant step forward in understanding the physical reasoning limits of today’s generative video models - paving the way for future AI systems that not only generate realistic videos but also reason about the physical world with accuracy and depth. 📊 Physics-IQ provides a comprehensive benchmark of 396 real-world videos, covering diverse physical scenarios - from fluid dynamics to solid mechanics, challenging AI models to predict future frames and interactions beyond surface-level visual cues. 🤔 The findings were eye-opening: even state-of-the-art models like #Sora, #Runway, and #VideoPoet create visually stunning clips but fail to capture true physical dynamics, revealing the gap between perception and understanding. 🚀 The project has been met with great interest from the research community, highlighting the importance of integrating experiential and interactive learning into next-generation video models. 📂 Explore the open-source dataset, evaluation code, and results - links in comments #GenerativeAI #VideoAI #AIResearch #PhysicsInAI #PhysicalReasoning #AIUnderstanding #AIBenchmark #OpenSourceAI #FutureOfAI #AIInnovation #INSAIT

686

INSAIT Institute

Martin Vechev retweeted

INSAIT Institute

@INSAITinstitute

23 Oct 2025

🔥 We’re releasing SPEAR-1 (spear.insait.ai) - a new robotic AI foundation model that achieves state-of-the-art performance with 20× less robotic data 🧠 Why it matters:SPEAR-1 is like the ChatGPT for robots - a single model that can perform many tasks, on any robot, in any environment. 💡 What’s new: unlike others, SPEAR-1 learns from both robotic and non-robotic 3D data, breaking the data bottleneck that slows robotic AI. 🤖 Open-weight, general-purpose, and multilingual for robots - a major step toward scalable robot learning. #Robotics #FoundationModels #3DPerception #Manipulation #INSAIT #Europe #DataEfficiency

17,486

INSAIT Institute

Martin Vechev retweeted

INSAIT Institute

@INSAITinstitute

25 Oct 2025

🚀 Big news for INSAIT, Bulgaria & Europe! @WIRED (read by 30M people/month) just profiled SPEAR-1 — INSAIT’s new foundation robotic model, the first released by Europe! 🤖 As WIRED notes, SPEAR-1 matches leading global models trained on many times more data - a huge leap toward ChatGPT-like AI for robotics! 👏 Congrats to all involved: Nikolay Nikolov, @giualbanese1, @DSombit, Jan-Nico Zaech, Danda Pani Paudel, Luc Van Gool & Alex Yanev. #AI #Robotics #Europe #INSAIT

571

INSAIT Institute

Martin Vechev retweeted

INSAIT Institute

@INSAITinstitute

28 Oct 2025

🚀 Major news! @Google expands its support for INSAIT with a new $1,000,000 contribution - targeting groundbreaking AI research and expanding INSAIT’s local ecosystem initiatives. 💰 Google’s total support for INSAIT now well exceeds $6 000 000, further strengthening INSAIT’s capacity to conduct world-class research and cultivate the next generation of AI talent in Bulgaria. 🌍 This milestone builds on years of collaboration - from @GoogleDeepMind PhD fellowships to funding for training AI models - helping position INSAIT as a world-class AI research organization. Google also 2x profiled BgGPT, the first Bulgarian LLM built by INSAIT in a series of articles read by millions around the world. ✨ A huge thank you to Google for the continued trust and meaningful support to INSAIT! #INSAIT #Google #AI #BgGPT #Innovation #Research #DeepMind #Gemma #Bulgaria

709

INSAIT Institute

Martin Vechev retweeted

INSAIT Institute

@INSAITinstitute

22 Oct 2025

🚀 INSAIT makes an impact at @ICCVConference! With more than 10,000 participants, ICCV is the world’s leading conference in AI and computer vision - and INSAIT’s booth has become one of its most vibrant spots. Hundreds of researchers, industry leaders, and students stopped by to discover what INSAIT is all about and to connect with our team. 🤖 From cutting-edge robotics and embodied AI to next-generation foundation models, visitors experienced firsthand how INSAIT is pushing the boundaries of global AI research. 🇧🇬 We’re proud to represent Bulgaria on the world stage - proving that world-class deep tech innovation thrives right here in Sofia. #ICCV2025 #INSAIT #AI #ComputerVision #Research #DeepTech #Bulgaria #Robotics

827

Nikola Jovanović

Martin Vechev retweeted

Nikola Jovanović @ni_jovanovic

22 Oct 2025

MathArena most viewed on alphaXiv😲 Cool work @askalphaxiv (although Apex is a different dataset, you should title this USAMO and link to our USAMO eval instead)

alphaXiv

@askalphaxiv

21 Oct 2025

We used DeepSeek OCR to extract every dataset from tables/charts across 500k AI arXiv papers for $1000 🚀 See which benchmarks are trending and discover datasets you didn't know existed Doing the same task with Mistral OCR would've cost $7500 👀

0:30

939

Jasper Dekoninck

Martin Vechev retweeted

Jasper Dekoninck @j_dekoninck

20 Oct 2025

New competition on MathArena 🥳 This is a nice one and can highly recommend to check out some of the traces. Seeing GPT-5 write dozens of pages and still fail for a problem you can solve in <10sec is very satisfying.

Nikola Jovanović @ni_jovanovic

20 Oct 2025

MathArena goes visual: We evaluated models such as GPT-5 on Math Kangaroo 2025, a recent contest for ages 6-19 where most tasks require visual reasoning. Models struggle the most with tasks for younger kids. For example, they get this task for 1st graders only 3% of the time 🧵

1,584

Nikola Jovanović

Martin Vechev retweeted

Nikola Jovanović @ni_jovanovic

20 Oct 2025

12,125

Kazuki Egashira

Martin Vechev retweeted

Kazuki Egashira @kazukiega

13 Oct 2025

🚨 Be careful when pruning an LLM! 🚨 Even when the model appears benign, it might start behaving maliciously (e.g., jailbroken) once you download and prune it. Here’s how our attack works 🧵 arxiv.org/abs/2510.07985

4,332

INSAIT Institute

Martin Vechev retweeted

INSAIT Institute

@INSAITinstitute

13 Oct 2025

🚀⚛️ Major result: we are announcing qblaze – a state-of-the-art quantum simulator, built by researchers at INSAIT and @ETH_en! 🥇 qblaze sets a record for the largest number factored to date with Shor’s algorithm by a quantum circuits simulator – a 39 bit number (549 755 813 701). In comparison, despite recent advances, the largest number factored on an actual quantum computer to date with Shor's algorithm, is 21. 📜 qblaze matches the previous record set with the specialized (for Shor’s algorithm) emulator shorgpu – except shorgpu used 2048 GPUs, while qblaze only uses 2 CPUs! ⚡qblaze outperforms publicly available industry quantum simulators including @IBM’s Qiskit Aer and @Microsoft’s Q# and on Shor and Grover archives a speed-up of over 2000x! 🧠 qblaze scales thanks to a novel sparse data structure and highly-optimized parallel algorithms – the research paper describing qblaze’s operation was accepted at ACM OOPSLA’25, and will be presented this week in Singapore. OOPSLA is a top research conference in programming languages and systems. 💻 qblaze is fully open source, documented, has an easy to use Python API and can be used as a drop-in replacement for IBM’s Qiskit simulators as well as other quantum frameworks. All about qblaze can be found at qblaze.org. 👏Congratulations to all qblaze authors: Hristo Venev (INSAIT), Dimitar Dimitrov (INSAIT), Timon Gehr (ETH Zurich), Martin Vechev (INSAIT, ETH Zurich) and Thien Udomsrirungruang (former INSAIT summer research fellow).

976

INSAIT Institute

Martin Vechev retweeted

INSAIT Institute

@INSAITinstitute

10 Oct 2025

🚀 We are excited to announce BrokenMath, the first benchmark designed to systematically evaluate sycophancy in theorem proving with large language models (LLMs), now live at sycophanticmath.ai! 🧩 We show that even the best LLMs can produce convincing but wrong proofs when given false statements by users - a behavior known as sycophancy. This poses a major challenge for deploying AI systems in math and science, where truthfulness and rigor are essential. 📘 BrokenMath introduces 504 expertly verified false theorems derived from 2025 national and international math competition problems, creating a realistic and challenging environment for studying model reliability and reasoning integrity. 📊 The results show that sycophancy is widespread, with even GPT-5 producing proofs for false statements 29% of the time. The issue worsens as problems become more difficult and when tasks involve proof-based reasoning. While mitigation strategies such as prompting, agentic reasoning, and fine-tuning provide partial relief, none fully resolve the issue. 🌐 Explore benchmark, datasets, and paper - links in comments. 👩‍🔬 Congratulations to all authors: Ivo Petrov (INSAIT), Jasper Dekoninck (ETH Zürich), Martin Vechev (INSAIT, ETH Zürich)

798

INSAIT Institute

Martin Vechev retweeted

INSAIT Institute

@INSAITinstitute

6 Oct 2025

🤖 INSAIT had a strong presence at #CoRL2025 – the leading conference in AI for robotics – held this year in Seoul which gathered more than 2500 participants! 🚀 In exciting news, INSAIT’s robotics team was one of two able to qualify for RoboArena - a challenge for evaluating robotics foundation AI models. INSAIT’s model was able to outperform state-of-the-art models such as Physical Intelligence’s pi0, while trained on 10x less data. Keep an eye on the release, coming soon! 🧠 We also presented MotoVLA, a new method for training robotics AI systems that drastically reduces the need for large labeled datasets, moving us a step closer to generalist robotic systems. 🏙️ Congratualtions to Alexander Marc Spiridonov, Nikolay Nikolov, Giuliano Albanese who represented INSAIT at CoRL 2025! 🔮 Its exciting to see that Bulgaria with INSAIT is now at the forefront of the emerging direction of physical intelligence!

504

Nikola Jovanović

Martin Vechev retweeted

Nikola Jovanović @ni_jovanovic

23 Sep 2025

MathArena Update: Claims about Grok 4 Fast seem to check out, it matches the performance of Grok 4 but is much faster and 20-50x cheaper. Good release! This holds across final-answer competitions, Apex problems, and Project Euler. 🧵

622

96,193