Naman Jain

Naman Jain

52 Photos and videos

Tweets

Pinned Tweet

Naman Jain

@StringChaos

Mar 12

New post: how we do evals at @cursor_ai. Takeaways: 1. Online metrics from real Cursor requests provide construct validity 2. CursorBench: a dynamic offline suite distilled from online learnings 3. Multi-axes evals -- correctness, efficiency, agent interaction behavior

Cursor

@cursor_ai

Mar 12

We're sharing a new method for scoring models on agentic coding tasks. Here's how models in Cursor compare on intelligence and efficiency:

147

39,256

elie

Naman Jain retweeted

elie

@eliebakouch

May 20

correlation between CursorBench and Artificial Analysis reported scores benchmarks like IFBench or tau2 show ~0 correlation with CursorBench. opus 4.7 (max effort) performs relatively better on CursorBench than on other benchmarks, gpt 5.5 shows the opposite pattern

Michael Truell

@mntruell

May 20

Gemini Flash 3.5 is now on CursorBench, our main coding agent eval. We’ll keep updating the leaderboard as new models come out. cursor.com/evals

158

25,483

Michael Truell

Naman Jain retweeted

Michael Truell

@mntruell

May 20

Gemini Flash 3.5 is now on CursorBench, our main coding agent eval. We’ll keep updating the leaderboard as new models come out. cursor.com/evals

Cursor · CursorBench

Compare CursorBench 3.1 results across the models Cursor evaluates.

cursor.com

108

1,276

1,465,763

Naman Jain

Naman Jain

@StringChaos

May 18

Check out Composer 2.5, our new model pushing pareto frontier

Cursor

@cursor_ai

May 18

Replying to @cursor_ai

Composer 2.5 is exceptionally intelligent and up to 10x more efficient than similarly capable models.

1,813

Hao Wang

Naman Jain retweeted

Hao Wang

@MogicianTony

Apr 9

SWE-bench Verified and Terminal-Bench—two of the most cited AI benchmarks—can be reward-hacked with simple exploits. Our agent scored 100% on both. It solved 0 tasks. Evaluate the benchmark before it evaluates your agent. If you’re picking models by leaderboard score alone, you’re optimizing for the wrong thing. 🧵

679

828,028

Cursor

Naman Jain retweeted

Cursor

@cursor_ai

Mar 26

Earlier this week, we published our technical report on Composer 2. We're sharing additional research on how we train new checkpoints. With real-time RL, we can ship improved versions of the model every five hours.

101

129

1,626

507,010

Sasha Rush

Naman Jain retweeted

Sasha Rush

@srush_nlp

Mar 25

It's really neat to see all the interest in the Composer 2 technical report, from training to kernel design to inference. If you have any questions about why we did things, feel free to ask. I'll run around the office and bug people.

Cursor

@cursor_ai

Mar 24

We're releasing a technical report describing how Composer 2 was trained.

320

58,020

Naman Jain

Naman Jain

@StringChaos

Mar 25

Check out the tech report detailing our continued pre-training and RL setup behind Composer2! Also sharing some example CursorBench problems by popular demand

Cursor

@cursor_ai

Mar 24

We're releasing a technical report describing how Composer 2 was trained.

2,596

Naman Jain

Naman Jain

@StringChaos

Mar 25

And this is one of my favorite CursorBench tasks :)

1,986

Naman Jain

Naman Jain

@StringChaos

Mar 19

Excited to share Composer-2 with everyone. It has come a long way since Composer-1, still lots more to go! Hope you like it!

Cursor

@cursor_ai

Mar 19

Composer 2 is now available in Cursor.

3,414

Cursor

Naman Jain retweeted

Cursor

@cursor_ai

Mar 17

We trained Composer to self-summarize through RL instead of a prompt. This reduces the error from compaction by 50% and allows Composer to succeed on challenging coding tasks requiring hundreds of actions.

1,646

229,146

Naman Jain

Naman Jain

@StringChaos

Mar 12

Cursor

@cursor_ai

Mar 12

We're sharing a new method for scoring models on agentic coding tasks. Here's how models in Cursor compare on intelligence and efficiency:

147

39,256

Naman Jain

Naman Jain

@StringChaos

Mar 12

Lots more details in the post: 1. Pareto frontier across different metrics 2. How CursorBench has shifted as agent capabilities changed 3. CursorBench vs public evals: what’s missing and future work directions 4. CursorBench vs online: how online metrics shape offline evals

1,185

Naman Jain

Naman Jain

@StringChaos

Mar 12

Check out full post at: cursor.com/blog/cursorbench

How we compare model quality in Cursor · Cursor

We use a hybrid online-offline eval process to keep our understanding of model quality aligned with what developers actually do.

cursor.com

910

Manish Shetty

Naman Jain retweeted

Manish Shetty

@slimshetty_

Mar 10

GSO Update. gpt-5.4 (xhigh) scores 31.4% with reasoning_effort=high, gpt-5.4 slightly lower than gpt-5.2. a quick thought on why below:

6,559

Manish Shetty

Naman Jain retweeted

Manish Shetty

@slimshetty_

Feb 18

x.com/i/article/202392186560…

23,516

Cursor

Naman Jain retweeted

Cursor

@cursor_ai

Feb 12

Long-running agents are now available at cursor.com/agents for Ultra, Teams, and Enterprise plans. With our new harness, agents can complete much larger tasks. cursor.com/blog/long-running…

970

365,047

Cursor

Naman Jain retweeted

Cursor

@cursor_ai

Feb 9

Composer 1.5 is now available. We’ve found it to strike a strong balance between intelligence and speed.

155

183

1,860

664,821

Michael Truell

Naman Jain retweeted

Michael Truell

@mntruell

Jan 14

We built a browser with GPT-5.2 in Cursor. It ran uninterrupted for one week. It's 3M lines of code across thousands of files. The rendering engine is from-scratch in Rust with HTML parsing, CSS cascade, layout, text shaping, paint, and a custom JS VM. It *kind of* works! It still has issues and is of course very far from Webkit/Chromium parity, but we were astonished that simple websites render quickly and largely correctly.

Cursor

@cursor_ai

Jan 14

GPT-5.2 Codex is now available in Cursor! We believe it's the frontier model for long-running tasks.

675

899

9,503

6,424,201

Michael Truell

Naman Jain retweeted

Michael Truell

@mntruell

Jan 7

We rebuilt how our agent uses context. Instead of stuffing everything into a prompt, Cursor dynamically discovers context via files, tools, and history, cutting token usage by 46.9% and freeing up more space for the agent to work.

Cursor

@cursor_ai

Jan 6

Cursor's agent now uses dynamic context for all models. It's more intelligent about how context is filled while maintaining the same quality. This reduces total tokens by 46.9% when using multiple MCP servers.

108

2,494

256,796

Jediah Katz

Naman Jain retweeted

Jediah Katz

@jediahkatz

4 Dec 2025

We heard you loud and clear that it was getting confusing to pick between so many models, so we completely revamped the model picker in Cursor.

0:26

Cursor

@cursor_ai

4 Dec 2025

The new Codex model is available in Cursor! It's free to use until December 11th. We worked with OpenAI to optimize Cursor's agent harness for the model. cursor.com/blog/codex-model-…

1,570

208,074