I’m excited to present this work today at #LREC2026 here in Mallorca, and I’m looking forward to talking to some of you who are around too!
#LLMs#nlproc#pragmatics
RExBench is now available in Terminal Bench (@harborframework)! 🎉
We integrate 2 tasks (cogs, othello) along with a local testing framework so you can test if your agents can autonomously implement novel AI research extensions.
Can coding agents autonomously implement AI research extensions?
We introduce RExBench, a benchmark that tests if a coding agent can implement a novel experiment based on existing research and code.
Finding: Most agents we tested had a low success rate, but there is promise!
🧵 Do coding agents know when to ask for help?
Real-world coding tasks are rarely fully specified, yet most agents are optimized to execute autonomously rather than clarify.
Diffusion LLMs can think EoS-by-EoS!
The higher the generation length, the better the performance of Masked Diffusion LLMs, even though they generate the same amount of words and only augment them with more and more EoS tokens 👀