Professor of Computer Science, ETH Zurich. Founder of INSAIT (insait.ai). Works on Safe/Secure AI, LLMs, Quantum. Co-founder of 6 Deep-Tech start-ups.

Joined June 2012
18 Photos and videos
Martin Vechev retweeted
1/ LLMs are increasingly being used to power high-interaction honeypots while maintaining a low security risk. But how good are they really? To answer this question, we introduce Honeyval, the first comprehensive eval framework for LLM-powered honeypots.
3
7
11
1,073
Martin Vechev retweeted
LLMs have become capable of proving complex mathematics. However, the proofs they produce vary significantly in how clear, motivated, and insightful they are. To measure these differences, we introduce ProofRank, the first benchmark to scalably evaluate aspects of proof quality.
3
11
32
5,313
Martin Vechev retweeted
How often do LLMs claim to prove false mathematical statements? In our latest benchmark, BrokenArXiv, we find they do so very often. The best model, GPT-5.4, only rejects 40% of incorrect statements obtained by perturbing recent ArXiv papers, and other models do much worse.
33
116
819
86,196
Martin Vechev retweeted
Are AGENTS actually useful for coding agents? In our latest preprint, we analyze the effect of context files on coding model performance. Short TLDR: It depends. Manually-written files help, while LLM-generated files (e.g. by the agent) don't. More details in the thread 🧵
3
13
54
8,625
Martin Vechev retweeted
How far can we push LLM performance on Project Euler (PE), a set of challenging mathematical programming problems? In a new blog, we explore this by designing specialized agents. With the help of the PE community, we also perform a qualitative analysis of model performance. 🧵
4
6
20
2,433
Grateful that Google gave us access to Gemini 3 to evaluate its performance on MathArena (math performance) before the release and these results are now included in their release and model card! Gemini 3 is topping all categories on MathArena, great work!
Full @GoogleAI Gemini 3 results are up on MathArena: ➡️ #1 on 2025 Final-Answer Competitions ➡️ #1 on Apex: 5.2% -> 23.4% new SOTA ➡️ #1 on Visual Math: 79% -> 84% new SOTA ➡️ #2 on Project Euler: 62%, huge jump compared to 2.5 Pro (15%)
7
950
Martin Vechev retweeted
🎥 Can AI video models truly understand physics? The newly released Physics-IQ benchmark, developed by INSAIT and Google DeepMind and led by Saman Motamed, PhD student at INSAIT, has sparked major discussion across the AI community following its presentation at #ICCV2025. 🔬 The work marks a significant step forward in understanding the physical reasoning limits of today’s generative video models - paving the way for future AI systems that not only generate realistic videos but also reason about the physical world with accuracy and depth. 📊 Physics-IQ provides a comprehensive benchmark of 396 real-world videos, covering diverse physical scenarios - from fluid dynamics to solid mechanics, challenging AI models to predict future frames and interactions beyond surface-level visual cues. 🤔 The findings were eye-opening: even state-of-the-art models like #Sora, #Runway, and #VideoPoet create visually stunning clips but fail to capture true physical dynamics, revealing the gap between perception and understanding. 🚀 The project has been met with great interest from the research community, highlighting the importance of integrating experiential and interactive learning into next-generation video models. 📂 Explore the open-source dataset, evaluation code, and results - links in comments #GenerativeAI #VideoAI #AIResearch #PhysicsInAI #PhysicalReasoning #AIUnderstanding #AIBenchmark #OpenSourceAI #FutureOfAI #AIInnovation #INSAIT
1
2
9
686
Martin Vechev retweeted
🔥 We’re releasing SPEAR-1 (spear.insait.ai) - a new robotic AI foundation model that achieves state-of-the-art performance with 20× less robotic data 🧠 Why it matters:SPEAR-1 is like the ChatGPT for robots - a single model that can perform many tasks, on any robot, in any environment. 💡 What’s new: unlike others, SPEAR-1 learns from both robotic and non-robotic 3D data, breaking the data bottleneck that slows robotic AI. 🤖 Open-weight, general-purpose, and multilingual for robots - a major step toward scalable robot learning. #Robotics #FoundationModels #3DPerception #Manipulation #INSAIT #Europe #DataEfficiency
2
5
27
17,486
Martin Vechev retweeted
🚀 Big news for INSAIT, Bulgaria & Europe! @WIRED (read by 30M people/month) just profiled SPEAR-1 — INSAIT’s new foundation robotic model, the first released by Europe! 🤖 As WIRED notes, SPEAR-1 matches leading global models trained on many times more data - a huge leap toward ChatGPT-like AI for robotics! 👏 Congrats to all involved: Nikolay Nikolov, @giualbanese1, @DSombit, Jan-Nico Zaech, Danda Pani Paudel, Luc Van Gool & Alex Yanev. #AI #Robotics #Europe #INSAIT
2
12
571
Martin Vechev retweeted
🚀 Major news! @Google expands its support for INSAIT with a new $1,000,000 contribution - targeting groundbreaking AI research and expanding INSAIT’s local ecosystem initiatives. 💰 Google’s total support for INSAIT now well exceeds $6 000 000, further strengthening INSAIT’s capacity to conduct world-class research and cultivate the next generation of AI talent in Bulgaria. 🌍 This milestone builds on years of collaboration - from @GoogleDeepMind PhD fellowships to funding for training AI models - helping position INSAIT as a world-class AI research organization. Google also 2x profiled BgGPT, the first Bulgarian LLM built by INSAIT in a series of articles read by millions around the world. ✨ A huge thank you to Google for the continued trust and meaningful support to INSAIT! #INSAIT #Google #AI #BgGPT #Innovation #Research #DeepMind #Gemma #Bulgaria
3
14
709
Martin Vechev retweeted
🚀 INSAIT makes an impact at @ICCVConference! With more than 10,000 participants, ICCV is the world’s leading conference in AI and computer vision - and INSAIT’s booth has become one of its most vibrant spots. Hundreds of researchers, industry leaders, and students stopped by to discover what INSAIT is all about and to connect with our team. 🤖 From cutting-edge robotics and embodied AI to next-generation foundation models, visitors experienced firsthand how INSAIT is pushing the boundaries of global AI research. 🇧🇬 We’re proud to represent Bulgaria on the world stage - proving that world-class deep tech innovation thrives right here in Sofia. #ICCV2025 #INSAIT #AI #ComputerVision #Research #DeepTech #Bulgaria #Robotics
3
14
827
Martin Vechev retweeted
MathArena most viewed on alphaXiv😲 Cool work @askalphaxiv (although Apex is a different dataset, you should title this USAMO and link to our USAMO eval instead)
21 Oct 2025
We used DeepSeek OCR to extract every dataset from tables/charts across 500k AI arXiv papers for $1000 🚀 See which benchmarks are trending and discover datasets you didn't know existed Doing the same task with Mistral OCR would've cost $7500 👀
2
11
939
Martin Vechev retweeted
New competition on MathArena 🥳 This is a nice one and can highly recommend to check out some of the traces. Seeing GPT-5 write dozens of pages and still fail for a problem you can solve in <10sec is very satisfying.
MathArena goes visual: We evaluated models such as GPT-5 on Math Kangaroo 2025, a recent contest for ages 6-19 where most tasks require visual reasoning. Models struggle the most with tasks for younger kids. For example, they get this task for 1st graders only 3% of the time 🧵
1
1
16
1,584
Martin Vechev retweeted
MathArena goes visual: We evaluated models such as GPT-5 on Math Kangaroo 2025, a recent contest for ages 6-19 where most tasks require visual reasoning. Models struggle the most with tasks for younger kids. For example, they get this task for 1st graders only 3% of the time 🧵
3
21
71
12,125
Martin Vechev retweeted
🚨 Be careful when pruning an LLM! 🚨 Even when the model appears benign, it might start behaving maliciously (e.g., jailbroken) once you download and prune it. Here’s how our attack works 🧵 arxiv.org/abs/2510.07985
1
16
23
4,332
Martin Vechev retweeted
🚀⚛️ Major result: we are announcing qblaze – a state-of-the-art quantum simulator, built by researchers at INSAIT and @ETH_en! 🥇 qblaze sets a record for the largest number factored to date with Shor’s algorithm by a quantum circuits simulator – a 39 bit number (549 755 813 701). In comparison, despite recent advances, the largest number factored on an actual quantum computer to date with Shor's algorithm, is 21. 📜 qblaze matches the previous record set with the specialized (for Shor’s algorithm) emulator shorgpu – except shorgpu used 2048 GPUs, while qblaze only uses 2 CPUs! ⚡qblaze outperforms publicly available industry quantum simulators including @IBM’s Qiskit Aer and @Microsoft’s Q# and on Shor and Grover archives a speed-up of over 2000x! 🧠 qblaze scales thanks to a novel sparse data structure and highly-optimized parallel algorithms – the research paper describing qblaze’s operation was accepted at ACM OOPSLA’25, and will be presented this week in Singapore. OOPSLA is a top research conference in programming languages and systems. 💻 qblaze is fully open source, documented, has an easy to use Python API and can be used as a drop-in replacement for IBM’s Qiskit simulators as well as other quantum frameworks. All about qblaze can be found at qblaze.org. 👏Congratulations to all qblaze authors: Hristo Venev (INSAIT), Dimitar Dimitrov (INSAIT), Timon Gehr (ETH Zurich), Martin Vechev (INSAIT, ETH Zurich) and Thien Udomsrirungruang (former INSAIT summer research fellow).
6
10
976
Martin Vechev retweeted
🚀 We are excited to announce BrokenMath, the first benchmark designed to systematically evaluate sycophancy in theorem proving with large language models (LLMs), now live at sycophanticmath.ai! 🧩 We show that even the best LLMs can produce convincing but wrong proofs when given false statements by users - a behavior known as sycophancy. This poses a major challenge for deploying AI systems in math and science, where truthfulness and rigor are essential. 📘 BrokenMath introduces 504 expertly verified false theorems derived from 2025 national and international math competition problems, creating a realistic and challenging environment for studying model reliability and reasoning integrity. 📊 The results show that sycophancy is widespread, with even GPT-5 producing proofs for false statements 29% of the time. The issue worsens as problems become more difficult and when tasks involve proof-based reasoning. While mitigation strategies such as prompting, agentic reasoning, and fine-tuning provide partial relief, none fully resolve the issue. 🌐 Explore benchmark, datasets, and paper - links in comments. 👩‍🔬 Congratulations to all authors: Ivo Petrov (INSAIT), Jasper Dekoninck (ETH Zürich), Martin Vechev (INSAIT, ETH Zürich)
2
2
10
798
Martin Vechev retweeted
🤖 INSAIT had a strong presence at #CoRL2025 – the leading conference in AI for robotics – held this year in Seoul which gathered more than 2500 participants! 🚀 In exciting news, INSAIT’s robotics team was one of two able to qualify for RoboArena - a challenge for evaluating robotics foundation AI models. INSAIT’s model was able to outperform state-of-the-art models such as Physical Intelligence’s pi0, while trained on 10x less data. Keep an eye on the release, coming soon! 🧠 We also presented MotoVLA, a new method for training robotics AI systems that drastically reduces the need for large labeled datasets, moving us a step closer to generalist robotic systems. 🏙️ Congratualtions to Alexander Marc Spiridonov, Nikolay Nikolov, Giuliano Albanese who represented INSAIT at CoRL 2025! 🔮 Its exciting to see that Bulgaria with INSAIT is now at the forefront of the emerging direction of physical intelligence!
1
3
504
Martin Vechev retweeted
MathArena Update: Claims about Grok 4 Fast seem to check out, it matches the performance of Grok 4 but is much faster and 20-50x cheaper. Good release! This holds across final-answer competitions, Apex problems, and Project Euler. 🧵
40
75
622
96,193