Research @cursor_ai | CursorBench, LiveCodeBench, DeepSWE, R2E-Gym, GSO, LMArena Coding | Past: @UCBerkeley @MetaAI @AWS @MSFTResearch @iitbombay

Joined March 2018
52 Photos and videos
Pinned Tweet
New post: how we do evals at @cursor_ai. Takeaways: 1. Online metrics from real Cursor requests provide construct validity 2. CursorBench: a dynamic offline suite distilled from online learnings 3. Multi-axes evals -- correctness, efficiency, agent interaction behavior
We're sharing a new method for scoring models on agentic coding tasks. Here's how models in Cursor compare on intelligence and efficiency:
5
18
147
39,256
Naman Jain retweeted
correlation between CursorBench and Artificial Analysis reported scores benchmarks like IFBench or tau2 show ~0 correlation with CursorBench. opus 4.7 (max effort) performs relatively better on CursorBench than on other benchmarks, gpt 5.5 shows the opposite pattern
Gemini Flash 3.5 is now on CursorBench, our main coding agent eval. We’ll keep updating the leaderboard as new models come out. cursor.com/evals
10
7
158
25,483
Naman Jain retweeted
Gemini Flash 3.5 is now on CursorBench, our main coding agent eval. We’ll keep updating the leaderboard as new models come out. cursor.com/evals
108
89
1,276
1,465,763
Check out Composer 2.5, our new model pushing pareto frontier
Replying to @cursor_ai
Composer 2.5 is exceptionally intelligent and up to 10x more efficient than similarly capable models.
26
1,813
Naman Jain retweeted
SWE-bench Verified and Terminal-Bench—two of the most cited AI benchmarks—can be reward-hacked with simple exploits. Our agent scored 100% on both. It solved 0 tasks. Evaluate the benchmark before it evaluates your agent. If you’re picking models by leaderboard score alone, you’re optimizing for the wrong thing. 🧵
22
90
679
828,028
Naman Jain retweeted
Earlier this week, we published our technical report on Composer 2. We're sharing additional research on how we train new checkpoints. With real-time RL, we can ship improved versions of the model every five hours.
101
129
1,626
507,010
Naman Jain retweeted
It's really neat to see all the interest in the Composer 2 technical report, from training to kernel design to inference. If you have any questions about why we did things, feel free to ask. I'll run around the office and bug people.
We're releasing a technical report describing how Composer 2 was trained.
35
18
320
58,020
Check out the tech report detailing our continued pre-training and RL setup behind Composer2! Also sharing some example CursorBench problems by popular demand
We're releasing a technical report describing how Composer 2 was trained.
1
3
48
2,596
And this is one of my favorite CursorBench tasks :)
1
1
17
1,986
Excited to share Composer-2 with everyone. It has come a long way since Composer-1, still lots more to go! Hope you like it!
Composer 2 is now available in Cursor.
3
5
66
3,414
Naman Jain retweeted
We trained Composer to self-summarize through RL instead of a prompt. This reduces the error from compaction by 50% and allows Composer to succeed on challenging coding tasks requiring hundreds of actions.
91
96
1,646
229,146
New post: how we do evals at @cursor_ai. Takeaways: 1. Online metrics from real Cursor requests provide construct validity 2. CursorBench: a dynamic offline suite distilled from online learnings 3. Multi-axes evals -- correctness, efficiency, agent interaction behavior
We're sharing a new method for scoring models on agentic coding tasks. Here's how models in Cursor compare on intelligence and efficiency:
5
18
147
39,256
Lots more details in the post: 1. Pareto frontier across different metrics 2. How CursorBench has shifted as agent capabilities changed 3. CursorBench vs public evals: what’s missing and future work directions 4. CursorBench vs online: how online metrics shape offline evals
1
1
12
1,185
Naman Jain retweeted
GSO Update. gpt-5.4 (xhigh) scores 31.4% with reasoning_effort=high, gpt-5.4 slightly lower than gpt-5.2. a quick thought on why below:
3
4
61
6,559
Naman Jain retweeted

2
6
54
23,516
Naman Jain retweeted
Long-running agents are now available at cursor.com/agents for Ultra, Teams, and Enterprise plans. With our new harness, agents can complete much larger tasks. cursor.com/blog/long-running…
61
95
970
365,047
Naman Jain retweeted
Composer 1.5 is now available. We’ve found it to strike a strong balance between intelligence and speed.
155
183
1,860
664,821
Naman Jain retweeted
We built a browser with GPT-5.2 in Cursor. It ran uninterrupted for one week. It's 3M lines of code across thousands of files. The rendering engine is from-scratch in Rust with HTML parsing, CSS cascade, layout, text shaping, paint, and a custom JS VM. It *kind of* works! It still has issues and is of course very far from Webkit/Chromium parity, but we were astonished that simple websites render quickly and largely correctly.
GPT-5.2 Codex is now available in Cursor! We believe it's the frontier model for long-running tasks.
675
899
9,503
6,424,201
Naman Jain retweeted
We rebuilt how our agent uses context. Instead of stuffing everything into a prompt, Cursor dynamically discovers context via files, tools, and history, cutting token usage by 46.9% and freeing up more space for the agent to work.
Cursor's agent now uses dynamic context for all models. It's more intelligent about how context is filled while maintaining the same quality. This reduces total tokens by 46.9% when using multiple MCP servers.
108
86
2,494
256,796
Naman Jain retweeted
We heard you loud and clear that it was getting confusing to pick between so many models, so we completely revamped the model picker in Cursor.
4 Dec 2025
The new Codex model is available in Cursor! It's free to use until December 11th. We worked with OpenAI to optimize Cursor's agent harness for the model. cursor.com/blog/codex-model-…
40
45
1,570
208,074