Joined November 2012
39 Photos and videos
qwen-3.5 plus and qwen3-max-thinking and opus 4.6
10
15
142
22,906
Added glm-5 and minimax-m2.5
4
1
21
1,141
Had to add gemini-3-flash-preview to the results. It dominates. Clearly the top model on this benchmark. Hopefully we can get a v2 of this bench out sometime soon.
Long context eval. Huge improvement since last year. The frontier models went from poor to great. An exciting standout is kimi-2.5. It made impressive progress without (presumably) a new architecture, putting up gemini-2.5-pro numbers which we were all impressed by last year.
13
14
171
20,444
Long context eval. Huge improvement since last year. The frontier models went from poor to great. An exciting standout is kimi-2.5. It made impressive progress without (presumably) a new architecture, putting up gemini-2.5-pro numbers which we were all impressed by last year.
13
21
247
39,938
gemini-3-pro-preview improves upon the strong results of gemini-2.5-pro and is now neck and neck with gpt-5.2 on top in the "almost perfect" tier.
1
4
646
claude-opus-4-5 fixed claude's long context performance, it is now good when previously it was a laggard. claude-sonnet-4-5 had a regression compared to sonnet 4… Same tier as grok-4.
3
886
Kimi-k2.5 now the Chinese/Open-source leader! Minimax??? gpt-5.2 improves on almost perfection in gpt-5 to now very close to perfect. gpt-5.2-pro did surprisingly poorly.
5
1,055
Fiction.live retweeted
12 Oct 2025
Some updates to Spiral Bench: - A more detailed rubric for protective vs delusion-reinforcing behaviours - Responses evaluated by a judge ensemble: sonnet-4.5, gpt-5 & kimi-k2 - New models evaluated: qwen3-235b, glm-4.6, grok-4-fast, mistral-medium-3.1
3
3
29
3,502
29 Sep 2025
Fiction.LiveBench for Long Context Deep Comprehension adds: deepseek-v3.2-exp [reasoning: high], deepseek-v3.2-exp, nemotron-nano-9b-v2:free, qwen-max, qwen3-next-80b-a3b-instruct.
18
10
96
20,220
29 Sep 2025
Thoughts: Interesting that we see an improvement for deepseek's reasoning mode but no improvement for the non-reasoning. It has high scores on the easier questions but very low scores on the hard ones. grok-4-fast is fairly close to sonama-sky-alpha while still being free.
3
13
1,262
29 Sep 2025
For more context on this eval: fiction.live/stories/Fiction…

6
1,129
12 Sep 2025
Long context tested for Qwen3-next-80b-a3b-thinking. Performs very similarly to qwen3-30b-a3b-thinking-2507 and far behind qwen3-235b-a22b.
8
6
67
28,452
12 Sep 2025
@Alibaba_Qwen You guys need better evals during development.
1
13
1,199
Tested sonoma-sky-alpha and sonoma-dusk-alpha @openrouter on Fiction.liveBench sky is frontier level! If this is an efficient version that can be served cheaply, that's a fantastic outcome with great performance. Very impressed by this.
10
15
143
19,216
Added results for kimi-k2-0905 and qwen3-30b-a3b-thinking-2507
15
13
184
12,608
Also I never mentioned this but I think it's pretty clear from these benchmarks that gpt-5 is smaller than 4o.
3
1
22
1,537