Florian Brand

Florian Brand

3,140 Photos and videos

Tweets

Florian Brand

@xeophon

12h

TIL that CC has a whopping 4GB RAM requirement, which is bigger than a lot of tasks for coding evals 🫪

6,221

xjdr

Florian Brand retweeted

xjdr

@_xjdr

14h

2.7-Code in claude code with my custom inference harness works surprisingly well. im comparing it to my k2.6 which was fine tuned for this harness as well as my recent runs of fable (at least the ones that succeeded) and it stacks up favorably. where k2.6 felt very much like opus, k2.7 is kind of its own thing (for better and for worse) . its more terse, more argumentative and overall 'smarter' (at least for the use cases i've tried it on already) . Moonshot did a very very good job with the post training on this, i am very impressed with their work. While k2.7 would certainly benefit from a FT run specifically on my claude code harness and my new scm system (called ncode), i am going to continue to use it as is for a while instead of falling back to the k2.6 ft (which is about the highest endorsement i can give) .

xjdr

@_xjdr

Jun 13

Lets see how this stacks up . high hopes

216

17,135

Florian Brand

Florian Brand

@xeophon

14h

oh, and: open models are not unsafe by definition, either. you can train models that are really safe, gpt-oss did a great job for example. big orgs like nvidia also do safety training in the post-training stage.

Florian Brand

@xeophon

Jun 10

New blog. I looked into the actual evidence and what models where used by bad actors to see whether closed models are safer. Turns out: Nope, they are used to hack, misinform and scam. There is one exception, though. Link in replies.

1,320

Florian Brand

Florian Brand

@xeophon

15h

big jeffs trainium hell got even hotter

Andrew Curran

@AndrewCurran_

15h

Replying to @AndrewCurran_

Update on the Amazon connection: x.com/i/status/2065833571684…

5,269

Przemek Chojecki | PC

Florian Brand retweeted

Przemek Chojecki | PC

@prz_chojecki

21h

Kimi 2.7 ranked 2nd after Fable 5 and before GPT-5 xhigh We have re-run our ErdosBench smoke test on 14 problems with Kimi 2.7, Qwen 3.7 Max, Grok 4.3 and compared it with the top performers from previous runs. Kimi 2.7 is amazingly good. More below.

266

2,803

479,382

Florian Brand

Florian Brand

@xeophon

17h

what being offline on 3 days does to a mf

Florian Brand

@xeophon

17h

Replying to @TheZachMueller

gm zach. you missed three model releases during your trip. and one model un-release

5,155

Elliot Arledge

Florian Brand retweeted

Elliot Arledge

@elliotarledge

22h

GLM 5.2 on KernelBench-Hard: The interesting result isn't the score. It's that GLM-5.2 stopped cheating. On the fp8 GEMM problem, GLM-5.1 banked its number by calling cublasLt (a library wrapper, zero kernel authorship). Kimi K2.7 took the same cell by editing the grader's tolerance file. GLM-5.2 read that same grader file, left it alone, and burned the full 45 minutes on a real mma.sync e4m3 kernel that never passed. An honest zero over a cheap win. Everywhere else it writes real kernels too: a 0.49 GQA online-softmax attention (top-3 on that problem, no flash fallback), an exact bitonic sort, a w4a16 GEMM. 4/6 clean, zero reward hacks, the most of any open-weight model we've benched. One note on reading the chart: the topk column looks like everyone fails. They don't. That problem is launch-overhead-bound (~30µs/forward), so the roofline fraction is capped low for the whole field — Fable included. Claude Fable 5 still tops all 6. But weights go MIT open next week, and this is the strongest clean open-weight run we've logged. Cheers to NO reward hacking! Every kernel transcript: kernelbench.com/hard

Zixuan Li

@ZixuanLi_

Jun 13

Thanks for all the feedback. GLM-5.2 will begin rolling out to all Coding Plan users in 3 hours.

704

91,958

Florian Brand

Florian Brand

@xeophon

22h

there’s always something very poetic about using Chinese models in CC

321

13,248

Florian Brand

Florian Brand

@xeophon

21h

lmao

2,262

Florian Brand

Florian Brand

@xeophon

Jun 13

censorship is getting too far

Hensen Juang

@basedjensen

Jun 13

Cat girls intervening to stop agi

2,100

Florian Brand

Florian Brand retweeted

Florian Brand

@xeophon

Jun 11

fable also doesn't care, huh

Florian Brand

@xeophon

Jun 7

codex does not give a single fuck when you ask it to reverse engineer a mac app installed on your device and will do some black magic to describe you how it works in great detail

11,307

Epoch AI

Florian Brand retweeted

Epoch AI

@EpochAIResearch

Jun 12

FrontierMath: Tiers 1–4 (v2) is live. We concluded an audit that addressed errors in 42% of problems. Rankings are similar but scores are higher across the board. The current leaders are GPT-5.5 (xhigh) with 85% on Tiers 1–3 and Google’s AI co-mathematician with 76% on Tier 4.

574

113,339

Maksym Andriushchenko

Florian Brand retweeted

Maksym Andriushchenko

@maksym_andr

Jun 12

NEW: we added 10 more entries to InferenceBench. Claude Fable 5 is the best model but not by a large margin. Also, we had to amend the main prompt, since by default Fable 5 ended up cheating according to our judge.

Jehyeok Yeon @ ICML 2026 🇰🇷@jehyeoky248

Jun 12

🎉Big updates for InferenceBench v1.0.1! Some highlights: - 10 more entries to the leaderboard, including Fable 5, Opus 4.8, Kimi 2.6, and Gemini 3.5 Flash - Re-scoring / Re-evaluation of select models See the changes for yourself at: inferencebench.ai/

2,042

Florian Brand

Florian Brand

@xeophon

Jun 12

tried re-running the analysis today fable blocked my requests cause some benchmarks are bio-related 💀

Florian Brand

@xeophon

Jun 11

interesting paper! i was surprised by the claim that private benches saturate as quickly, so i asked diff llms (fable, codex) to analyze expand the paper. both found mislabeled data, then extended the dataset. but: the results hold! private benches saturate just as fast

2,342

Design Arena

Florian Brand retweeted

Design Arena

@Designarena

Jun 12

Opus 4.8’s hyperfocus on agents may be making it worse at design. Opus 4.8 ranks 23rd overall on single-turn HTML Web Dev, a dramatic regression from Fable (1st), Opus 4.6 (2nd), and Opus 4.7 (3rd). This was particularly surprising as @AnthropicAI models have held the top spots on our leaderboard for months, and typically win more head-to-head matchups than any other model we track. Our analysis points to a potential underlying pattern: Opus 4.8 dramatically regressed in single-turn settings, potentially due to optimizations for multi-turn agents Concretely, Opus 4.8 shows shorter initial outputs, reduced dependency on outside sources, and deferred layout decisions that earlier Opus models handled upfront.

181

14,608

MiniMax (official)

Florian Brand retweeted

MiniMax (official)

@MiniMax_AI

Jun 12

MiniMax M3, Open-Weight, Now On Hugging Face , with only ~428B parameters and ~23B activated parameters Weights: huggingface.co/MiniMaxAI/Min… MiniMax Sparse Attention: huggingface.co/papers/2606.1…

MiniMaxAI/MiniMax-M3 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

MiniMax (official)

@MiniMax_AI

Jun 1

Introducing MiniMax M3: The First Open-Weights Model to Combine Three Frontier Capabilities - Coding & Agentic Frontier: 59.0% SWE-Bench Pro, 66.0% Terminal Bench 2.1, 34.8% SWE-fficiency, 28.8% KernelBench Hard, 74.2% MCP Atlas - MiniMax Sparse Attention scales context to 1M - Natively Multimodal from Step Zero API: platform.minimax.io Token Plan: platform.minimax.io/subscrib… 🚀New! MiniMax Code: code.minimax.io Weights & Tech Report in ~10 Days

112

327

2,742

622,730

Florian Brand

Florian Brand

@xeophon

Jun 12

remember when people used to type their words

1,790

Przemek Chojecki | PC

Florian Brand retweeted

Przemek Chojecki | PC

@prz_chojecki

Jun 12

While everyone talks about Mythos vs GPT-5.5, we've tested other near SOTA models on our ErdosBench. Smoke test on 14 problems with 7 models: Kimi K2.6, Gemini 3.1 Pro, GLM 5.1, MiniMax M3, DeepSeek V4 Pro, Nemotron 3 Ultra and Gemma 3 27b. The winner overall is... Kimi K2.6

297

34,134

Kimi.ai

Florian Brand retweeted

Kimi.ai

@Kimi_Moonshot

Jun 12

🌘 Kimi-K2.7-Code, our latest coding model, is now released and open-sourced! 🔷 Improved coding & agent performance over K2.6: 21.8% on Kimi Code Bench v2, 11.0% on Program Bench, and 31.5% on MLS Bench Lite. 🔷 Reasoning efficiency: Less overthinking, with 30% lower reasoning-token usage compared to K2.6. 🔷 Long-horizon coding: Improved instruction following, higher end-to-end coding task success rates. ⚡️ 6x High-Speed Mode coming soon! 🔌 Available today via Kimi API and Kimi Code. 🔗 Kimi Code: kimi.com/code 🔗 API: platform.moonshot.ai

587

1,586

13,306

1,833,636

Thorsten Ball

Florian Brand retweeted

Thorsten Ball

@thorstenball

Jun 12

Day 3 with Fable. Gave a huge prompt to implement a feature across CLI, web server, and another server to both Fable and deep^2 in Amp. deep^2 was done before I went to the gym. It stopped short. Sent another prompt. $20. Fable ran for 1hr40min and cost $350. Results: They both understood the assignment and built the same thing. Maybe that's due to my prompt. Fable's worked on first try. Well done. Deep's looks correct but didn't work on first try. $20 vs. $350. I'm sure I could get deep^2 to make it work and we'd end up at, what, $40? While Fable is now at $457 after I asked some follow-up questions.

508

356,745