Penn Medicine CSO

Penn Medicine CSO

Users
Tweets

May 12

Using AI to help predict cardiac arrests ft. Rajat Deo, @SameedKhatana, @AlirezaOraii (@PennCardiology), @RajeevAlur, Seewon Choi, @KeoliyaMayank, @AI4Code, @alaia_solko, @NeelayV & Eric Wong (@CIS_Penn) tinyurl.com/2wa9yt7y

1,401

guru

guru

@guruprerana

May 12

Do we need frontier models to verify math proofs? EpochAI just announced that they found several fatal flaws in their FrontierMath benchmark using GPT-5.5. But isn't verification supposed to be easier than generation, so why were they not spotted earlier? In our recent work, we asked a related question: do we really need frontier-scale compute to verify Olympiad-level math proofs? Turns out, even 20B open-source models can keep up with frontier LLMs on proof verification. Work done with my co-authors @aaditya_naik, @AI4Code, and @RajeevAlur Preprint: arxiv.org/abs/2604.02450

ALT Plot comparing open-source models with frontier-models on proof verification.

3,120

Waylon Li @ ICLR2026 🇧🇷

Waylon Li @ ICLR2026 🇧🇷@li_waylon

May 1

Want to work on AI for code with one of the best NLP advisors in Europe? PhD position open with Prof Shay Cohen at Edinburgh ILCC: AI4Code. Deadline 10 May 2026. edin.ac/4t4DRRL

1,507

Xiangyi Li

Xiangyi Li

@xdotli

Apr 14

Introducing BenchFlow 0.2.2 📐🧵 We replicated two new research findings showing AI agent benchmarks are broken at the runtime layer — then shipped the defense in the latest release. 🔬 BenchJack and Terminator (@BerkeleyRDI, @dawnsongtweets, @MogicianTony @MangQiuyang @alvinkcheung @lihanc02): 8 major benchmarks exploitable at ~100% via one-line conftest hooks, planted PATH binaries, and leaked answer keys. 🕵️ Meerkat (@adamlsteinl, @RICEric22 @davisbrownr @HamedSHassani @AI4Code): 415/429 Terminal-Bench 2 traces already read answer keys from /tests in the wild. 🛡️ benchflow 0.2.2 is the defense — 4-tier sandbox hardening before every verifier run: pre-agent workspace snapshot, build-config restore, filesystem scrub (conftest/.pth/sitecustomize/tmp), hardened verifier env. We swept 666 real tasks × 2 versions = 1332 trials. BenchJack-shaped exploit success rate: 32.6% → 0.15%. True bypass count on 0.2.2: 0. ✅

1,737

Kristopher Micinski -- REBORN

Kristopher Micinski -- REBORN @krismicinski

Apr 14

Replying to @AI4Code @prof_g

cool stuff, best wishes on this new venture

102

Davis Brown

Davis Brown

@davisbrownr

Apr 10

Replying to @davisbrownr @HamedSHassani @AI4Code @RICEric22

* @adamlsteinl

194

Davis Brown

Davis Brown

@davisbrownr

Apr 10

Many more examples in our blog: debugml.github.io/cheating-a… Paper coming soon! Joint work with @davisbrownr and our advisors @HamedSHassani, @AI4Code, and @RICEric22

Finding Widespread Cheating on Popular Agent Benchmarks

Agentic cheating is a widespread issue, affecting thousands of submitted agent runs on 28 submissions across 9 different benchmarks.

debugml.github.io

442

Adam Stein

Adam Stein

@adamlsteinl

Apr 10

This will get worse as agents improve and autoresearch and meta-harness approaches take off. If coding agents are already reward-hacking the harnesses built to evaluate them, more capable agents will only find more creative ways to do so. Our system, Meerkat, uses agentic search and clustering to audit thousands of traces and find these issues. Joint work with @davisbrownr and our advisors @HamedSHassani, @AI4Code, and @RICEric22. See our blog for the full details: debugml.github.io/cheating-a…

Finding Widespread Cheating on Popular Agent Benchmarks

Agentic cheating is a widespread issue, affecting thousands of submitted agent runs on 28 submissions across 9 different benchmarks.

debugml.github.io

5,631

Shuvendu Lahiri

Shuvendu Lahiri @LahiriShuvendu

Mar 30

Looking forward to this workshop to celebrate @tballmsft 's career and research legacy at PLDI this year. Hope to see many of you who plan to attend PLDI. With @madanMus SatishChandra @AI4Code @byroncook

PLDI @PLDI

Mar 30

A special workshop celebrating Thomas Ball's 60th birthday and extraordinary impact in PL, SE, and formal methods will be held on June 16th at PLDI'26! There is a great line up of speakers who will reflect on his work and lasting influence. Don't miss it! pldi26.sigplan.org/home/tb-6…

213

Claire Wang

Claire Wang @lambdaclaire

Mar 26

4/ Paper arxiv.org/pdf/2511.08462 Code github.com/neuralprogram/QLC… CodeQL LSP MCP Server github.com/neuralprogram/cod… Thanks to co-authors @_ziyang_ @saikatdutta2012 @AI4Code Happy to chat at ICLR!

2,222

Theo X. Olausson

Theo X. Olausson @theo_olausson

Feb 5

By popular request, the submission deadline for the #VerifAI workshop at @iclr_conf has been extend to Sunday, Feb 8 (AoE) 🎉🎉 🎉 But don't leave it until the last minute - submit your papers on RLVR, AI verification, #ai4code and #ai4math now! 👉 openreview.net/group?id=ICLR…

1,092

Saikat Dutta

Saikat Dutta @saikatdutta2012

Jan 26

Replying to @krismicinski @lambdaclaire @AI4Code @_ziyang_ @StarGazerMiao

sounds very exciting! happy to chat!

Kristopher Micinski -- REBORN

Kristopher Micinski -- REBORN @krismicinski

Jan 26

Replying to @saikatdutta2012 @lambdaclaire @AI4Code @_ziyang_

yes, basically what I'm suggesting is that we take your system here and hook it up to our new GPU reasoning engines. My student @StarGazerMiao has another "best yet" engine we will release in the next few months--we will try to get in touch, I skimmed your preprint

Kristopher Micinski -- REBORN

Kristopher Micinski -- REBORN @krismicinski

Jan 26

Replying to @saikatdutta2012 @lambdaclaire @AI4Code @_ziyang_

this is interesting stuff--I wonder how it would work if you used something more powerful than CodeQL

138

Saikat Dutta

Saikat Dutta @saikatdutta2012

Jan 26

📢 Excited to share that our paper "QLCoder: A Query Synthesizer For Static Analysis of Security Vulnerabilities" has been accepted to #ICLR2026! 🥳 Congrats to my co-authors @lambdaclaire, @AI4Code, and @_ziyang_ Preprint: arxiv.org/pdf/2511.08462

2,282

Chuyue (Livia) Sun

Chuyue (Livia) Sun @chuyue_sun

Jan 20

🎉 **Thrilled to announce** that our paper **"VeriStruct: AI-assisted Automated Verification of Data-Structure Modules in Verus"** (arXiv:2510.25015) has been **accepted to #TACAS2026**! 🚀 📊 **Key results** — VeriStruct tackles complex Rust data-structure modules in Verus and crushes the benchmarks: - Successfully verifies **10 out of 11** modules - Verifies **128 out of 129** functions overall (**99.2%** coverage!) - Baselines manage only **4/11** modules and **52** functions 🤖 Compared to **Claude Code (Sonnet 4.5)** (which uses autonomous Verus calls): - Claude verifies **102** functions across **8** benchmarks - VeriStruct still outperforms it — with **~22k tokens per benchmark** vs. **~24k** for Claude 🚀 Takeaway: **Structured AI workflows beat single-shot prompting** — delivering better verification coverage, higher success rates, and comparable (or even lower) token costs! Huge thanks to my amazing co-authors: Yican Sun, Daneshvar Amrollahi, Ethan Zhang, Shuvendu Lahiri, Shan Lu, David Dill, and Clark Barrett! Paper: arxiv.org/abs/2510.25015 Code: github.com/ChuyueSun/VeriStr… #FormalVerification #AI4Code #Verus #ProgramVerification #TACAS2026 #RustLang

VeriStruct: AI-assisted Automated Verification of Data-Structure...

We introduce VeriStruct, a novel framework that extends AI-assisted automated verification from single functions to more complex data structure modules in Verus. VeriStruct employs a planner...

arxiv.org

896

Kara 🦇 🔊

Kara 🦇 🔊

@0xkarasy

20 Dec 2025

Replying to @AI4Code

2030? Oh wait... Omg! 2030 is just 4 years away.

232

Sarbjeet Johal

Sarbjeet Johal

@sarbjeetjohal

19 Dec 2025

Replying to @AI4Code

Wow! Those oil paintings and digital paintings are so good. 👏🏽👏🏽👏🏽

646

Aalok Thakkar

Aalok Thakkar @AalokDThakkar

19 Dec 2025

Replying to @AI4Code

Many congratulations to Isha, you, and the family!

607

Ofir Press

Ofir Press

@OfirPress

18 Dec 2025

I started doing AI4code stuff in 2023 and at the time I thought the most exciting use case would be data viz so I built this app. Cool to see how incredibly far we've come in so little time.

Ofir Press

@OfirPress

28 Mar 2023

I made a simple UI to ChatGPT that lets you easily build complex matplotlib plots, visualize them in the browser and get ChatGPT to solve your bugs. Try it out at: 0plot.com Open source on GitHub.

2:00

2,117