neekhil vatsa

neekhil vatsa

Users
Tweets

neekhil vatsa

@garfieldII

All other LLMs: Let us Benchmax our models Le Chaton- let us get a fat Cat and measure its fat percentage and report it as our models capability

西村/learningBOX/競プロｱｶ

西村/learningBOX/競プロｱｶ

@ynishi2015

OpenRouterのFusion APIはBenchMax的なイメージが強いが、ベンチマークで強ければ強いので、ベンチマークが出そろうのを待ちたい。出そろうの？

512

The AI Scope

The AI Scope @the_ai_scope

MiniMax = benchmax Rio = benchmax Nex = benchmax DeepSeek = real Kimi = real GLM = real

am.will

am.will

@LLMJunky

10h

Replying to @0xSero

I think all the labs benchmax, especially if they're publicly traded. Doesn't mean it's not also a good model though

165

Nour Eddine Hamaidi

Nour Eddine Hamaidi

@NOOROU

Jun 13

Replying to @bridgemindai @bridgebench

Benchmax?

133

Loïc Schneider

Loïc Schneider

@modkin_mp

Jun 13

Replying to @modkin_mp @ZixuanLi_

I guess you are dropping a model soon. please dont benchmax

Orthorexic Apopheniac

Orthorexic Apopheniac @LittleNonsuch

Jun 12

Replying to @ivanfioravanti

Benchmax or no? I assume you are not running it locally

Zach

Zach

@zach_sndr

Jun 12

DeepSWE shows you the true reality amidst all this benchmax bullshit. I'm glad I stuck to my OG gpt 5.5 without spending money on minimax, fable and now the new kimi. Instead of diverse subs and endless routing complications, went all in on Pro 20x plan and its worth every penny. manual resets is just the cherry on top!

316

Tony Feng

Tony Feng

@tonylfeng

Jun 12

Replying to @AlexKontorovich

The frontier labs were already bored of math contests by the 2025 Putnam, and they will certainly not bother with IMO 2026. I could imagine that some formalization startup might still try pursue a Lean-verified IMO perfect score, which has not been accomplished yet ... but I hope they don't -- we all know that Codex/Claude could do this easily, so please just leave the spotlight to the students. As for research math: it appears the only frontier labs pushing this have been OpenAI and Google DeepMind. My guess from observing OpenAI is that they use Erdos problems (only) as some kind of internal benchmark, but they do not try to benchmax and they only release solutions that are sufficiently interesting. DeepMind is a different beast, which pursues math and science applications for their own sake and will likely continue doing so. I think there will also continue to be a space for startups to do research math things. They're not expected to make revenue for a while, and they can find some story to spin to VCs about why it will eventually be profitable.

981

murtaza

murtaza

@pierizvi

Jun 11

Replying to @karpathy

oh please dont benchmax bro

Jake

Jake

@JakeKAllDay

Jun 10

Replying to @AdamHoltererer @ArtificialAnlys @AnthropicAI

yes for now we have cursor bench which gives good data, its just v coding specific. I like the AAII cost token score data because there's nowhere to hide -- those tests are too long to benchmax collectively and we can see performance:cost:latency tradeoff clearly

225

Chao Wang

Chao Wang @excel_wang

Jun 10

Replying to @ASM65617010

Is the result based on the public question set that model developers could hack or benchmax, or the private held-out questions?

Furkan Gözükara

Andreas Oschinski retweeted

Furkan Gözükara

@FurkanGozukara

Jun 9

Lol they retain every single prompt permanently forever How do you think they benchmax and improve?

jpark

@jparkjmc

Jun 9

new policy from anthropic: if you use fable/mythos, they collect your data. no exceptions. not even for enterprise partners.

1,764

Dan Roy

Dan Roy

@roydanroy

Jun 10

If you were a company and you were going to benchmax once, when would you do it?

4,788

Xu Zou

Xu Zou @xz_keg

Jun 10

Replying to @yifanzhang_

The image only supports "there's no wall for benchmax"

322

AdamHumphreys

AdamHumphreys @AdamJHumphreys

Jun 10

Replying to @kimmonismus

Benchmax, annnnd benchmax

121

ph

@wiredaddict

Jun 9

benchmax benchopium

Angel D. Muñoz

Angel D. Muñoz @angel_d_munoz

Jun 9

Replying to @haider1

They've been nerfing models for a while after release, so the new releases feel "amazing" I noticed this trend last year, they benchmax release, then dumb down over the months. That's my reason for not using anthropic (and google models). It's just marketing and milking cows

603