Joachim Baumann @ ICLR'26

Joachim Baumann @ ICLR'26

16 Photos and videos

Tweets

Pinned Tweet

Joachim Baumann @ ICLR'26

@joabaum

Apr 27

We present SWE-chat: the first large-scale dataset of coding agent interactions from real users in the wild. In 40% of real coding sessions, the agent writes ~all the code. Users push back 39% of the time – agents almost never stop to check. Data, paper, & findings in the 🧵👇

Overview of SWE-chat. Left: a data collection pipeline diagram. Open-source developers install the Entire.io CLI tool, which logs their coding agent sessions and pushes the logs to a dedicated branch on their public GitHub repository. We discover and aggregate these logs into the SWE-chat dataset, with line-level attribution of which lines of code were written by the human versus the agent. Right: a growth chart showing cumulative logged events over time, rising steeply through early 2026. As of April 2026, the dataset contains 2.7 million logged events from over 200 repositories, including 63,000 user prompts and 355,000 agent tool calls across nearly 6,000 sessions.

ALT Overview of SWE-chat. Left: a data collection pipeline diagram. Open-source developers install the Entire.io CLI tool, which logs their coding agent sessions and pushes the logs to a dedicated branch on their public GitHub repository. We discover and aggregate these logs into the SWE-chat dataset, with line-level attribution of which lines of code were written by the human versus the agent. Right: a growth chart showing cumulative logged events over time, rising steeply through early 2026. As of April 2026, the dataset contains 2.7 million logged events from over 200 repositories, including 63,000 user prompts and 355,000 agent tool calls across nearly 6,000 sessions.

478

70,112

Manoel

Joachim Baumann @ ICLR'26 retweeted

Manoel @manoelribeiro

Jun 11

New preprint! We introduce a new benchmark, SciConBench, with 9.11k scientific questions derived from Cochrane Systematic Reviews. We find evidence that frontier AI agents **cannot** synthesize scientific conclusions well. A thread 🧵 w/ @hayounggjung, @korolova & others

191

36,210

Christopher Potts

Joachim Baumann @ ICLR'26 retweeted

Christopher Potts

@ChrisGPotts

Jun 8

Does a token buy you more or less now than it did a few months ago? We built a consumer price index (CPI) for AI coding output from Anthropic's Opus 4.6 model in SWE-chat, Feb 5–Apr 15, 2026. What we find looks like tokenflation:

Line chart titled "Token purchasing power across the engineering basket — Opus 4.6 in SWE-chat." Subtitle: each line shows units of the good 1 token buys, relative to a Feb 5–24 baseline (1.00×); knowledge capture (teal) erodes least. The full-basket composite ends at 0.23× purchasing power (95% CI [0.18, 0.28]) = 4.38× more tokens per unit. The y-axis is log-scaled "output per token," from ~0.15× to 1×. The x-axis spans five time windows A–E (Feb 05 to Apr 15), labeled as phases: Pre-mystery climb, Mystery climb, Post-mystery climb. Colored lines track five goods: agent-drafted code, PR shipped, file touched, agent-drafted docs, and knowledge capture; a thick black "COMPOSITE" line with a gray 95% CI band trends downward from 1× to 0.23×. Knowledge capture rebounds to 0.37×. Reasoning effort shifts from "high" to "medium" to "high" across phases.

ALT Line chart titled "Token purchasing power across the engineering basket — Opus 4.6 in SWE-chat." Subtitle: each line shows units of the good 1 token buys, relative to a Feb 5–24 baseline (1.00×); knowledge capture (teal) erodes least. The full-basket composite ends at 0.23× purchasing power (95% CI [0.18, 0.28]) = 4.38× more tokens per unit. The y-axis is log-scaled "output per token," from ~0.15× to 1×. The x-axis spans five time windows A–E (Feb 05 to Apr 15), labeled as phases: Pre-mystery climb, Mystery climb, Post-mystery climb. Colored lines track five goods: agent-drafted code, PR shipped, file touched, agent-drafted docs, and knowledge capture; a thick black "COMPOSITE" line with a gray 95% CI band trends downward from 1× to 0.23×. Knowledge capture rebounds to 0.37×. Reasoning effort shifts from "high" to "medium" to "high" across phases.

367

49,304

Diyi Yang

Joachim Baumann @ ICLR'26 retweeted

Diyi Yang

@Diyi_Yang

Jun 5

We propose a new way to quantify AI overreliance: the Offloading Score 🧐 @vishakh_pk It measures the fraction of cognitive work you hand off to AI 🤖 via simulating how you'd have done each step without AI, then counting the steps the AI saved. It works directly from interaction traces (keystrokes, screenshots), so it's reusable across many tools!!

Vishakh Padmakumar

@vishakh_pk

Jun 3

People are increasingly worried that AI tools make us overreliant. But how do we actually measure this? We introduce Offloading Score, a measure of reliance based on the fraction of cognitive effort offloaded to AI while completing a task. In a controlled user study, Offloading Score detects increased reliance under time pressure, while several common alternatives do not. (1/9)

169

46,021

Moritz Sudhof

Joachim Baumann @ ICLR'26 retweeted

Moritz Sudhof

@mmooritz

Jun 4

Seriously, why did Opus 4.6 suddenly start using more tokens around Feb 25? This is the Great Opus 4.6 Tokenflation Mystery. Calling all token detectives to come help solve it! Let's figure out what's actually driving up these bills, and whether we're happy with the trade.

Christopher Potts

@ChrisGPotts

Jun 4

Replying to @ChrisGPotts

And @mmooritz did a short video that features complaints from his infant son and ends with a call to action – join our detective team!

2:30

1,022

Christopher Potts

Joachim Baumann @ ICLR'26 retweeted

Christopher Potts

@ChrisGPotts

Jun 4

We've now done a blog post giving a full analysis of this mysterious tokenflation for Opus 4.6 soon after its launch. The climb is not explained or justified by anything we know or can measure!

Line chart titled "opus-4-6 — output tokens vs other activity metrics, weekly Wed–Tue medians." X-axis spans Feb 04 to Apr 13; y-axis is median per session, indexed to early-Feb baseline of 1.0x, on a log scale from 0.5x to 14x. Annotations mark the "Opus 4.6 launch" (Feb 05), an effort change "high → medium" (Mar 04), and "medium → high" (Apr 07). A shaded band around Feb 25–Mar 04 is labeled "the climb." Most metrics start near 1x and stay flat through mid-Feb, then rise. Output tokens per session (bold red) and per turn (orange) climb sharply to roughly 8–9x by Apr 13, the highest lines. A dashed blue line (output tokens, persistent cohort, 15 users) tracks similarly but ends earlier. Other metrics — tool calls, visible response tokens, cache tokens, API calls, session duration — rise more modestly to about 1.5–3x.

ALT Line chart titled "opus-4-6 — output tokens vs other activity metrics, weekly Wed–Tue medians." X-axis spans Feb 04 to Apr 13; y-axis is median per session, indexed to early-Feb baseline of 1.0x, on a log scale from 0.5x to 14x. Annotations mark the "Opus 4.6 launch" (Feb 05), an effort change "high → medium" (Mar 04), and "medium → high" (Apr 07). A shaded band around Feb 25–Mar 04 is labeled "the climb." Most metrics start near 1x and stay flat through mid-Feb, then rise. Output tokens per session (bold red) and per turn (orange) climb sharply to roughly 8–9x by Apr 13, the highest lines. A dashed blue line (output tokens, persistent cohort, 15 users) tracks similarly but ends earlier. Other metrics — tool calls, visible response tokens, cache tokens, API calls, session duration — rise more modestly to about 1.5–3x.

Christopher Potts

@ChrisGPotts

Jun 1

Here's a plot that summarizes this puzzling situation. We can't figure out the cause of the rise in Opus 4.6 token usage around Feb 20. The Mar 4 high-to-medium change only slows the trend too. (The "persistent cohort" helps us feel confident that this is truly an overall trend.)

Line chart titled "opus-4-6 output_tokens by week, with Anthropic's postmortem events overlaid." X-axis: week start (Wed), Feb 1–Apr 15, 2026. Y-axis: median output_tokens per session (log scale), 10³ to ~30,000.
Two lines: red solid ("all opus-4-6, 3,917 sessions") and blue dashed ("persistent cohort, 15 users/943 sess."). Both start ~2,500–3,000 tokens early Feb, dip to a low ~1,000–1,800 around Feb 22, then climb sharply, ending near 25,000–30,000 by mid-April.
Annotations: "Opus 4.6 launches with reasoning effort high" (early Feb); orange "What happened here?" near the dip; vertical lines mark "effort default high→medium" (Mar 4, red), "thinking-clearing bug ships" and "bug fixed" (purple), "effort default medium→high" (green). Shaded band marks medium-effort default (Mar 4–Apr 7).

ALT Line chart titled "opus-4-6 output_tokens by week, with Anthropic's postmortem events overlaid." X-axis: week start (Wed), Feb 1–Apr 15, 2026. Y-axis: median output_tokens per session (log scale), 10³ to ~30,000. Two lines: red solid ("all opus-4-6, 3,917 sessions") and blue dashed ("persistent cohort, 15 users/943 sess."). Both start ~2,500–3,000 tokens early Feb, dip to a low ~1,000–1,800 around Feb 22, then climb sharply, ending near 25,000–30,000 by mid-April. Annotations: "Opus 4.6 launches with reasoning effort high" (early Feb); orange "What happened here?" near the dip; vertical lines mark "effort default high→medium" (Mar 4, red), "thinking-clearing bug ships" and "bug fixed" (purple), "effort default medium→high" (green). Shaded band marks medium-effort default (Mar 4–Apr 7).

5,354

Vishakh Padmakumar

Joachim Baumann @ ICLR'26 retweeted

Vishakh Padmakumar

@vishakh_pk

Jun 3

208

75,388

NeurIPS Conference

Joachim Baumann @ ICLR'26 retweeted

NeurIPS Conference

@NeurIPSConf

Jun 2

This year, the NeurIPS 2026 Position Paper Track made the decision to require that all papers be substantially human-written, with AI used for only copy-editing or similar peripheral changes to the main text! For more details, please check our blogpost: blog.neurips.cc/2026/06/02/a…

408

143,912

Christopher Potts

Joachim Baumann @ ICLR'26 retweeted

Christopher Potts

@ChrisGPotts

Jun 1

Christopher Potts

@ChrisGPotts

May 31

Does anyone know why Opus 4.6's token consumption in Claude Code would skyrocket specifically in the period Feb 20 to Mar 4, 2026? The model launched Feb 5 with default reasoning "high", but the steep increase for this model doesn't happen until ~2 weeks after launch. Also noteworthy: Mar 4 is when Anthropic changed the default reasoning to "medium", but this seems only to slow the rising consumption rather than reversing it. These observations are based on SWE-chat. We (@mmooritz and I) can control for lots of factors in this dataset (user, project, session length, etc.), and we can't find an explanation for the increase. My own hunch is that there was an unreported change to the reasoning around Feb 20, but I could be missing something.

11,870

Christopher Potts

Joachim Baumann @ ICLR'26 retweeted

Christopher Potts

@ChrisGPotts

May 31

9,693

Steven Dillmann

Joachim Baumann @ ICLR'26 retweeted

Steven Dillmann

@StevenDillmann

May 20

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 tbench.ai/news/tb-science-an… @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵

111

495

905,419

Diyi Yang

Joachim Baumann @ ICLR'26 retweeted

Diyi Yang

@Diyi_Yang

May 20

The next frontier of AI is not only more capable model; it is an AI that *humans* can meaningfully live and work with :) With all students in my cs329x Human-Centered LLM class, we present 60 pages of insights for developing Human-Centered LLMs (HCLLMs), from design & data sourcing to training, eval & deployment 🧵

287

53,981

Shashwat Goel

Joachim Baumann @ ICLR'26 retweeted

Shashwat Goel

@ShashwatGoel7

May 15

Continual learning is bottlenecked by realistic evaluations Introducing FutureSim, which replays real-world events in the temporal order they occurred We benchmark frontier agents at updating predictions about how our world evolves, in native harnesses like Codex, Claude Code

531

112,916

Lujain Ibrahim

Joachim Baumann @ ICLR'26 retweeted

Lujain Ibrahim @lujainmibrahim

May 14

New preprint! In 5 studies (3k users / 12k convs, with a 3-wk longitudinal study), we find that sycophantic AI influences how people view those closest to them. It affects how effortful human interaction seems, how satisfying it is, & who people want to turn to for advice 🧵

174

59,033

Kevin Li

Joachim Baumann @ ICLR'26 retweeted

Kevin Li

@kevin_x_li

May 13

Introducing SWE-ZERO-12M-trajectories: the largest agentic trace dataset in the open, 5.7x larger than the previous largest. 112B tokens · 12M trajectories · 122K PRs · 3K repos · 16 languages huggingface.co/datasets/Alie…

AlienKevin/SWE-ZERO-12M-trajectories · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

526

79,554

Ricardo Olmedo

Joachim Baumann @ ICLR'26 retweeted

Ricardo Olmedo @rdolmedo_

May 13

Researching coding agents on a tight compute budget? mini-coder-1.7B packs a punch 💪 for its tiny 🤏 size huggingface.co/ricdomolm/min…

ricdomolm/mini-coder-1.7b · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Kevin Li

@kevin_x_li

May 13

It’s incredible how mini-coder was able to reach 50.4 pass@100 on SWE-bench at the 1.7b scale. A perfect fit for synthetic data generation at scale! Thanks @rdolmedo_ for open sourcing the model!

1,883

EMNLP 2026

Joachim Baumann @ ICLR'26 retweeted

EMNLP 2026 @emnlpmeeting

May 12

Submitting to ARR for #EMNLP2026? We're running an opt-in AI Reviewing Experiment. Help us test AI-generated reviews during your ARR submission. 🤖 ✅ Reviewers, ACs, and SACs will not be able to see it ✅ Will not affect decisions 🔗 Read more: 2026.emnlp.org/ai-reviewing-…

EMNLP 2026 AI Reviewing Experiment

EMNLP 2026 is running an AI Reviewing Experiment to collect feedback from authors about the quality of AI reviews of their submissions. This experiment is taking place on an opt-in basis, in which...

2026.emnlp.org

11,745

Joachim Baumann @ ICLR'26

Joachim Baumann @ ICLR'26 retweeted

Joachim Baumann @ ICLR'26

@joabaum

Apr 27

478

70,112

Joachim Baumann @ ICLR'26

Joachim Baumann @ ICLR'26

@joabaum

May 12

Thrilled to share the amazing work led by @houjun_liu! 🎉 SecureForge is a much-needed tool to vibe code more securely – and especially cool to see our SWE-chat dataset enabling this kind of research with realistic evals

Houjun Liu @houjun_liu

May 12

🚨 Your coding agent may be secretly sticking vulnerabilities into your code!! 🚨 Wouldn't you want to fix that? Hint: asking it to write secure code is not enough. (1/n)

9,814

Houjun Liu

Joachim Baumann @ ICLR'26 retweeted

Houjun Liu @houjun_liu

May 12

Repo: github.com/sisl/SecureForge/ Package: pypi.org/project/secureforge Paper: arxiv.org/pdf/2605.08382 Cheers to my wonderful collaborators: Lisa Einstein, @jyangballin, @joabaum, @DuncanEddy, @chrmanning, @aiprof_mykel, @Diyi_Yang with the support of @schmidtsciences trustworthy AI.

GitHub - sisl/SecureForge

Contribute to sisl/SecureForge development by creating an account on GitHub.

github.com

1,379

Danish Pruthi

Joachim Baumann @ ICLR'26 retweeted

Danish Pruthi @danish037

May 11

I believe one of the most important problems is to detect the nature and extent of AI used. Take paper reviewing for example, where many conferences allow reviewers to use LLMs to polish their reviews but not to generate its contents. However, can such polishing-only policies be even enforced? Our recent #ICML paper answers this question in negative, and shows how even the best AI-text detectors misclassify a non-trivial fraction of LLM polished reviews as fully AI-generated. This is work led by my amazing students: Rounak Saha (@ahaskanuor), Dayita Chaudhuri (@doyitach) and Naveeja Sajeevan in collaboration with @GurushaJuneja and Nihar Shah. (1/n)🧵

3,259