Postdoc @StanfordNLP @StanfordAILab / Prev: @MilaNLProc @UZH_en @MPI_IS @CarnegieMellon. CompSocSci, LLMs, algorithmic fairness.

Joined February 2021
16 Photos and videos
Pinned Tweet
We present SWE-chat: the first large-scale dataset of coding agent interactions from real users in the wild. In 40% of real coding sessions, the agent writes ~all the code. Users push back 39% of the time – agents almost never stop to check. Data, paper, & findings in the 🧵👇
14
78
478
70,112
Joachim Baumann @ ICLR'26 retweeted
New preprint! We introduce a new benchmark, SciConBench, with 9.11k scientific questions derived from Cochrane Systematic Reviews. We find evidence that frontier AI agents **cannot** synthesize scientific conclusions well. A thread 🧵 w/ @hayounggjung, @korolova & others
10
51
191
36,210
Joachim Baumann @ ICLR'26 retweeted
Does a token buy you more or less now than it did a few months ago? We built a consumer price index (CPI) for AI coding output from Anthropic's Opus 4.6 model in SWE-chat, Feb 5–Apr 15, 2026. What we find looks like tokenflation:
23
47
367
49,304
Joachim Baumann @ ICLR'26 retweeted
We propose a new way to quantify AI overreliance: the Offloading Score 🧐 @vishakh_pk It measures the fraction of cognitive work you hand off to AI 🤖 via simulating how you'd have done each step without AI, then counting the steps the AI saved. It works directly from interaction traces (keystrokes, screenshots), so it's reusable across many tools!!
People are increasingly worried that AI tools make us overreliant. But how do we actually measure this? We introduce Offloading Score, a measure of reliance based on the fraction of cognitive effort offloaded to AI while completing a task. In a controlled user study, Offloading Score detects increased reliance under time pressure, while several common alternatives do not. (1/9)
3
23
169
46,021
Joachim Baumann @ ICLR'26 retweeted
Seriously, why did Opus 4.6 suddenly start using more tokens around Feb 25? This is the Great Opus 4.6 Tokenflation Mystery. Calling all token detectives to come help solve it! Let's figure out what's actually driving up these bills, and whether we're happy with the trade.
Replying to @ChrisGPotts
And @mmooritz did a short video that features complaints from his infant son and ends with a call to action – join our detective team!
2
4
1,022
Joachim Baumann @ ICLR'26 retweeted
We've now done a blog post giving a full analysis of this mysterious tokenflation for Opus 4.6 soon after its launch. The climb is not explained or justified by anything we know or can measure!
Here's a plot that summarizes this puzzling situation. We can't figure out the cause of the rise in Opus 4.6 token usage around Feb 20. The Mar 4 high-to-medium change only slows the trend too. (The "persistent cohort" helps us feel confident that this is truly an overall trend.)
1
3
44
5,354
Joachim Baumann @ ICLR'26 retweeted
People are increasingly worried that AI tools make us overreliant. But how do we actually measure this? We introduce Offloading Score, a measure of reliance based on the fraction of cognitive effort offloaded to AI while completing a task. In a controlled user study, Offloading Score detects increased reliance under time pressure, while several common alternatives do not. (1/9)
7
74
208
75,388
Joachim Baumann @ ICLR'26 retweeted
This year, the NeurIPS 2026 Position Paper Track made the decision to require that all papers be substantially human-written, with AI used for only copy-editing or similar peripheral changes to the main text! For more details, please check our blogpost: blog.neurips.cc/2026/06/02/a…

17
60
408
143,912
Joachim Baumann @ ICLR'26 retweeted
Here's a plot that summarizes this puzzling situation. We can't figure out the cause of the rise in Opus 4.6 token usage around Feb 20. The Mar 4 high-to-medium change only slows the trend too. (The "persistent cohort" helps us feel confident that this is truly an overall trend.)
Does anyone know why Opus 4.6's token consumption in Claude Code would skyrocket specifically in the period Feb 20 to Mar 4, 2026? The model launched Feb 5 with default reasoning "high", but the steep increase for this model doesn't happen until ~2 weeks after launch. Also noteworthy: Mar 4 is when Anthropic changed the default reasoning to "medium", but this seems only to slow the rising consumption rather than reversing it. These observations are based on SWE-chat. We (@mmooritz and I) can control for lots of factors in this dataset (user, project, session length, etc.), and we can't find an explanation for the increase. My own hunch is that there was an unreported change to the reasoning around Feb 20, but I could be missing something.
4
4
23
11,870
Joachim Baumann @ ICLR'26 retweeted
Does anyone know why Opus 4.6's token consumption in Claude Code would skyrocket specifically in the period Feb 20 to Mar 4, 2026? The model launched Feb 5 with default reasoning "high", but the steep increase for this model doesn't happen until ~2 weeks after launch. Also noteworthy: Mar 4 is when Anthropic changed the default reasoning to "medium", but this seems only to slow the rising consumption rather than reversing it. These observations are based on SWE-chat. We (@mmooritz and I) can control for lots of factors in this dataset (user, project, session length, etc.), and we can't find an explanation for the increase. My own hunch is that there was an unreported change to the reasoning around Feb 20, but I could be missing something.
2
2
19
9,693
Joachim Baumann @ ICLR'26 retweeted
📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 tbench.ai/news/tb-science-an… @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵
16
111
495
905,419
Joachim Baumann @ ICLR'26 retweeted
The next frontier of AI is not only more capable model; it is an AI that *humans* can meaningfully live and work with :) With all students in my cs329x Human-Centered LLM class, we present 60 pages of insights for developing Human-Centered LLMs (HCLLMs), from design & data sourcing to training, eval & deployment 🧵
14
78
287
53,981
Joachim Baumann @ ICLR'26 retweeted
Continual learning is bottlenecked by realistic evaluations Introducing FutureSim, which replays real-world events in the temporal order they occurred We benchmark frontier agents at updating predictions about how our world evolves, in native harnesses like Codex, Claude Code
21
65
531
112,916
Joachim Baumann @ ICLR'26 retweeted
New preprint! In 5 studies (3k users / 12k convs, with a 3-wk longitudinal study), we find that sycophantic AI influences how people view those closest to them. It affects how effortful human interaction seems, how satisfying it is, & who people want to turn to for advice 🧵
6
54
174
59,033
Joachim Baumann @ ICLR'26 retweeted
Introducing SWE-ZERO-12M-trajectories: the largest agentic trace dataset in the open, 5.7x larger than the previous largest. 112B tokens · 12M trajectories · 122K PRs · 3K repos · 16 languages huggingface.co/datasets/Alie…
19
67
526
79,554
Joachim Baumann @ ICLR'26 retweeted
Researching coding agents on a tight compute budget? mini-coder-1.7B packs a punch 💪 for its tiny 🤏 size huggingface.co/ricdomolm/min…
It’s incredible how mini-coder was able to reach 50.4 pass@100 on SWE-bench at the 1.7b scale. A perfect fit for synthetic data generation at scale! Thanks @rdolmedo_ for open sourcing the model!
3
16
1,883
Joachim Baumann @ ICLR'26 retweeted
Submitting to ARR for #EMNLP2026? We're running an opt-in AI Reviewing Experiment. Help us test AI-generated reviews during your ARR submission. 🤖 ✅ Reviewers, ACs, and SACs will not be able to see it ✅ Will not affect decisions 🔗 Read more: 2026.emnlp.org/ai-reviewing-…
2
20
95
11,745
Joachim Baumann @ ICLR'26 retweeted
We present SWE-chat: the first large-scale dataset of coding agent interactions from real users in the wild. In 40% of real coding sessions, the agent writes ~all the code. Users push back 39% of the time – agents almost never stop to check. Data, paper, & findings in the 🧵👇
14
78
478
70,112
Thrilled to share the amazing work led by @houjun_liu! 🎉 SecureForge is a much-needed tool to vibe code more securely – and especially cool to see our SWE-chat dataset enabling this kind of research with realistic evals
🚨 Your coding agent may be secretly sticking vulnerabilities into your code!! 🚨 Wouldn't you want to fix that? Hint: asking it to write secure code is not enough. (1/n)
2
8
23
9,814
Joachim Baumann @ ICLR'26 retweeted
I believe one of the most important problems is to detect the nature and extent of AI used. Take paper reviewing for example, where many conferences allow reviewers to use LLMs to polish their reviews but not to generate its contents. However, can such polishing-only policies be even enforced? Our recent #ICML paper answers this question in negative, and shows how even the best AI-text detectors misclassify a non-trivial fraction of LLM polished reviews as fully AI-generated. This is work led by my amazing students: Rounak Saha (@ahaskanuor), Dayita Chaudhuri (@doyitach) and Naveeja Sajeevan in collaboration with @GurushaJuneja and Nihar Shah. (1/n)🧵
1
9
41
3,259