Today I finally get to share something our team has been quietly grinding on for months โ we've created an ๐ผ๐ฝ๐ฒ๐ป ๐๐ผ๐๐ฟ๐ฐ๐ฒ๐ฑ ๐๐ฒ๐ฟ๐๐ถ๐ผ๐ป ๐ผ๐ณ Cursor ๐๐ฒ๐ป๐ฐ๐ต
@cursor_ai .
If youโve been following Cursorโs Composer launch and their internal "Cursor Bench" for testing vibe coding models, you can think of our ๐๐๐๐ ๐ฏ๐ฒ๐ป๐ฐ๐ต as the open-source, model-agnostic counterpart.
Here is what we provide by
@SFResearch . With ๐๐๐๐ ๐ฏ๐ฒ๐ป๐ฐ๐ต we:
โข Ship a ๐๐๐ฟ๐๐ผ๐ฟ-๐๐๐๐น๐ฒ ๐ฎ๐ด๐ฒ๐ป๐ ๐๐๐ฎ๐ฐ๐ธ: ReAct loop, semantic @ codebase search, grep, file read/write, refactor tools, and a three-tier memory system inspired by production coding assistants like Cursor.
โข ๐ง๐ฎ๐ธ๐ฒ ๐ด,๐ฌ๐ฌ๐ฌ ๐ฟ๐ฒ๐ฎ๐น-๐๐ผ๐ฟ๐น๐ฑ ๐๐ถ๐ฏ๐ฒ ๐ฐ๐ผ๐ฑ๐ถ๐ป๐ด ๐๐ฐ๐ฒ๐ป๐ฎ๐ฟ๐ถ๐ผ๐ and turn them into interactive agent gyms across 10 languages and 10Kโ1M token codebases.
โข Let you plug in any model (GPT-5, Claude Sonnet 4.5, Gemini 2.5 Pro, etc.) and see how it actually behaves on long, messy, multi-turn coding tasks.
A few fun findings: Cursor-style agents with context management are surprisingly robust at 1M-token contexts, but thereโs a hard trade-off between deep exploration vs. efficiency โ no one frontier model sits in the โperfectโ top-right corner yet. Anthropic Claude 4.5 and Google Gemini 2.5 pro are at the Pareto Frontier.
Everything is open source (agent, code, scenarios, traces, metrics) on
@huggingface:
๐ Tech Report:
arxiv.org/pdf/2509.09614
๐ค GitHub:
github.com/SalesforceAIReseaโฆ
๐ค Dataset:
huggingface.co/datasets/jasoโฆ
If youโre building coding agents, benchmarking your model against GPT/Claude/Gemini, or want to train your coding agents with RL in real coding environments, weโd love for you to try LCBA bench, and tell us your findings!