Marie-Leontine Wörgötter

Marie-Leontine Wörgötter

2 Photos and videos

Tweets

Nicholas Edwards retweeted

Marie-Leontine Wörgötter @mlwfee

May 13

I’m excited to present this work today at #LREC2026 here in Mallorca, and I’m looking forward to talking to some of you who are around too! #LLMs #nlproc #pragmatics

1,273

Yukyung Lee

Nicholas Edwards retweeted

Yukyung Lee @yukyunglee_

May 5

Excited to share that RExBench has been accepted to ACL main! 🎉🎉

6,309

Nicholas Edwards

Nicholas Edwards @nedwards99

Apr 17

RExBench is now available in Terminal Bench (@harborframework)! 🎉 We integrate 2 tasks (cogs, othello) along with a local testing framework so you can test if your agents can autonomously implement novel AI research extensions.

2,178

more replies

Nicholas Edwards

Nicholas Edwards @nedwards99

Apr 17

Thanks to @Mike_A_Merrill and @alexgshaw for early discussions, and to @LinShi592021 and the Adapters team for help with integration!

216

Nicholas Edwards

Nicholas Edwards @nedwards99

Apr 17

Check out the original RExBench announcement for more details about the benchmark: x.com/yukyunglee_/status/194…

Yukyung Lee @yukyunglee_

2 Jul 2025

Can coding agents autonomously implement AI research extensions? We introduce RExBench, a benchmark that tests if a coding agent can implement a novel experiment based on existing research and code. Finding: Most agents we tested had a low success rate, but there is promise!

300

Nicholas Edwards

Nicholas Edwards @nedwards99

Apr 1

🧵 Do coding agents know when to ask for help? Real-world coding tasks are rarely fully specified, yet most agents are optimized to execute autonomously rather than clarify.

1,068

more replies

Nicholas Edwards

Nicholas Edwards @nedwards99

Apr 1

This was work done with @sebschu. Check out the paper for more: Paper: arxiv.org/abs/2603.26233 Code: github.com/nedwards99/ask-or…

Ask or Assume? Uncertainty-Aware Clarification-Seeking in Coding Agents

As Large Language Model (LLM) agents are increasingly deployed in open-ended domains like software engineering, they frequently encounter underspecified instructions that lack crucial context....

arxiv.org

120

Nicholas Edwards

Nicholas Edwards @nedwards99

Apr 1

The interactive SWE-bench Verified setting is adapted from Vijayvargiya et al. (2026): arxiv.org/abs/2502.13069

Ambig-SWE: Interactive Agents to Overcome Underspecificity in...

AI agents are increasingly being deployed to automate tasks, often based on underspecified user instructions. Making unwarranted assumptions to compensate for the missing information and failing...

arxiv.org

Sarah Breckner

Nicholas Edwards retweeted

Sarah Breckner @hieristSarah

Mar 13

Diffusion LLMs can think EoS-by-EoS! The higher the generation length, the better the performance of Masked Diffusion LLMs, even though they generate the same amount of words and only augment them with more and more EoS tokens 👀

310