Filter
Exclude
Time range
-
Near
Using AI to help predict cardiac arrests ft. Rajat Deo, @SameedKhatana, @AlirezaOraii (@PennCardiology), @RajeevAlur, Seewon Choi, @KeoliyaMayank, @AI4Code, @alaia_solko, @NeelayV & Eric Wong (@CIS_Penn) tinyurl.com/2wa9yt7y
3
7
1,401
Do we need frontier models to verify math proofs? EpochAI just announced that they found several fatal flaws in their FrontierMath benchmark using GPT-5.5. But isn't verification supposed to be easier than generation, so why were they not spotted earlier? In our recent work, we asked a related question: do we really need frontier-scale compute to verify Olympiad-level math proofs? Turns out, even 20B open-source models can keep up with frontier LLMs on proof verification. Work done with my co-authors @aaditya_naik, @AI4Code, and @RajeevAlur Preprint: arxiv.org/abs/2604.02450
2
4
18
3,120
Want to work on AI for code with one of the best NLP advisors in Europe? PhD position open with Prof Shay Cohen at Edinburgh ILCC: AI4Code. Deadline 10 May 2026. edin.ac/4t4DRRL

5
15
1,507
Introducing BenchFlow 0.2.2 πŸ“πŸ§΅ We replicated two new research findings showing AI agent benchmarks are broken at the runtime layer β€” then shipped the defense in the latest release. πŸ”¬ BenchJack and Terminator (@BerkeleyRDI, @dawnsongtweets, @MogicianTony @MangQiuyang @alvinkcheung @lihanc02): 8 major benchmarks exploitable at ~100% via one-line conftest hooks, planted PATH binaries, and leaked answer keys. πŸ•΅οΈ Meerkat (@adamlsteinl, @RICEric22 @davisbrownr @HamedSHassani @AI4Code): 415/429 Terminal-Bench 2 traces already read answer keys from /tests in the wild. πŸ›‘οΈ benchflow 0.2.2 is the defense β€” 4-tier sandbox hardening before every verifier run: pre-agent workspace snapshot, build-config restore, filesystem scrub (conftest/.pth/sitecustomize/tmp), hardened verifier env. We swept 666 real tasks Γ— 2 versions = 1332 trials. BenchJack-shaped exploit success rate: 32.6% β†’ 0.15%. True bypass count on 0.2.2: 0. βœ…
6
6
21
1,737
Replying to @AI4Code @prof_g
cool stuff, best wishes on this new venture
3
102
This will get worse as agents improve and autoresearch and meta-harness approaches take off. If coding agents are already reward-hacking the harnesses built to evaluate them, more capable agents will only find more creative ways to do so. Our system, Meerkat, uses agentic search and clustering to audit thousands of traces and find these issues. Joint work with @davisbrownr and our advisors @HamedSHassani, @AI4Code, and @RICEric22. See our blog for the full details: debugml.github.io/cheating-a…
4
3
62
5,631
Looking forward to this workshop to celebrate @tballmsft 's career and research legacy at PLDI this year. Hope to see many of you who plan to attend PLDI. With @madanMus SatishChandra @AI4Code @byroncook
Mar 30
A special workshop celebrating Thomas Ball's 60th birthday and extraordinary impact in PL, SE, and formal methods will be held on June 16th at PLDI'26! There is a great line up of speakers who will reflect on his work and lasting influence. Don't miss it! pldi26.sigplan.org/home/tb-6…
16
213
4/ Paper arxiv.org/pdf/2511.08462 Code github.com/neuralprogram/QLC… CodeQL LSP MCP Server github.com/neuralprogram/cod… Thanks to co-authors @_ziyang_ @saikatdutta2012 @AI4Code Happy to chat at ICLR!

2
5
2,222
By popular request, the submission deadline for the #VerifAI workshop at @iclr_conf has been extend to Sunday, Feb 8 (AoE) πŸŽ‰πŸŽ‰ πŸŽ‰ But don't leave it until the last minute - submit your papers on RLVR, AI verification, #ai4code and #ai4math now! πŸ‘‰ openreview.net/group?id=ICLR…
1
5
1,092
sounds very exciting! happy to chat!
2
52
yes, basically what I'm suggesting is that we take your system here and hook it up to our new GPU reasoning engines. My student @StarGazerMiao has another "best yet" engine we will release in the next few months--we will try to get in touch, I skimmed your preprint
1
2
52
this is interesting stuff--I wonder how it would work if you used something more powerful than CodeQL
1
2
138
πŸ“’ Excited to share that our paper "QLCoder: A Query Synthesizer For Static Analysis of Security Vulnerabilities" has been accepted to #ICLR2026! πŸ₯³ Congrats to my co-authors @lambdaclaire, @AI4Code, and @_ziyang_ Preprint: arxiv.org/pdf/2511.08462

1
5
28
2,282
πŸŽ‰ **Thrilled to announce** that our paper **"VeriStruct: AI-assisted Automated Verification of Data-Structure Modules in Verus"** (arXiv:2510.25015) has been **accepted to #TACAS2026**! πŸš€ πŸ“Š **Key results** β€” VeriStruct tackles complex Rust data-structure modules in Verus and crushes the benchmarks: - Successfully verifies **10 out of 11** modules - Verifies **128 out of 129** functions overall (**99.2%** coverage!) - Baselines manage only **4/11** modules and **52** functions πŸ€– Compared to **Claude Code (Sonnet 4.5)** (which uses autonomous Verus calls): - Claude verifies **102** functions across **8** benchmarks - VeriStruct still outperforms it β€” with **~22k tokens per benchmark** vs. **~24k** for Claude πŸš€ Takeaway: **Structured AI workflows beat single-shot prompting** β€” delivering better verification coverage, higher success rates, and comparable (or even lower) token costs! Huge thanks to my amazing co-authors: Yican Sun, Daneshvar Amrollahi, Ethan Zhang, Shuvendu Lahiri, Shan Lu, David Dill, and Clark Barrett! Paper: arxiv.org/abs/2510.25015 Code: github.com/ChuyueSun/VeriStr… #FormalVerification #AI4Code #Verus #ProgramVerification #TACAS2026 #RustLang
1
1
9
896
Replying to @AI4Code
2030? Oh wait... Omg! 2030 is just 4 years away.
3
232
Replying to @AI4Code
Wow! Those oil paintings and digital paintings are so good. πŸ‘πŸ½πŸ‘πŸ½πŸ‘πŸ½
1
4
646
Replying to @AI4Code
Many congratulations to Isha, you, and the family!
2
607
18 Dec 2025
I started doing AI4code stuff in 2023 and at the time I thought the most exciting use case would be data viz so I built this app. Cool to see how incredibly far we've come in so little time.
28 Mar 2023
I made a simple UI to ChatGPT that lets you easily build complex matplotlib plots, visualize them in the browser and get ChatGPT to solve your bugs. Try it out at: 0plot.com Open source on GitHub.
1
1
14
2,117