Filter
Exclude
Time range
-
Near
CrazyCao retweeted
We benchmarked 7 frontier models on 3 categories of autoresearch tasks: ML engineering, harness/prompt engineering, and algorithmic discovery. Fable-5 won overall even under cost constraint, but on ML engineering, the open model Kimi-K2.7-Code surpassed frontier models.๐Ÿงต(1/5)
12
21
274
20,730
Replying to @7384254b @_imsigh
waitwaitwait could I use karpathy autoresearch to make a .. ooh.. ouhhh
1
1
8
the collaboration aspect is fun, maybe the first truly collaborative autoresearch comp?
1
25
Replying to @zhengyaojiang
Did you do an analysis on similarity of answers or scores across problems. Iโ€™m specially curious given the recent results from @OpenRouter Fusion API: Maybe a bag of models, would perform better. Specially, in a verifiable domain as autoresearch.

Introducing the Fusion API, the smartest compound model in the market. Fusion achieves Fable-level intelligence at half the price. How it works ๐Ÿ‘‡
1
2
687