Had to add gemini-3-flash-preview to the results. It dominates. Clearly the top model on this benchmark. Hopefully we can get a v2 of this bench out sometime soon.

Fiction.live @ficlive

Jan 30

Long context eval. Huge improvement since last year. The frontier models went from poor to great. An exciting standout is kimi-2.5. It made impressive progress without (presumably) a new architecture, putting up gemini-2.5-pro numbers which we were all impressed by last year.

171

20,444

Fiction.live

Fiction.live @ficlive

Jan 30

247

39,938

Fiction.live

Fiction.live @ficlive

Jan 30

gemini-3-pro-preview improves upon the strong results of gemini-2.5-pro and is now neck and neck with gpt-5.2 on top in the "almost perfect" tier.

646

Fiction.live

Fiction.live @ficlive

Jan 30

claude-opus-4-5 fixed claude's long context performance, it is now good when previously it was a laggard. claude-sonnet-4-5 had a regression compared to sonnet 4… Same tier as grok-4.

886

Fiction.live

Fiction.live @ficlive

Jan 30

Kimi-k2.5 now the Chinese/Open-source leader! Minimax??? gpt-5.2 improves on almost perfection in gpt-5 to now very close to perfect. gpt-5.2-pro did surprisingly poorly.

1,055

Fiction.live

Fiction.live @ficlive

Jan 30

fiction.live/stories/Fiction…

1,220

Sam Paech

Fiction.live retweeted

Sam Paech @sam_paech

12 Oct 2025

Some updates to Spiral Bench: - A more detailed rubric for protective vs delusion-reinforcing behaviours - Responses evaluated by a judge ensemble: sonnet-4.5, gpt-5 & kimi-k2 - New models evaluated: qwen3-235b, glm-4.6, grok-4-fast, mistral-medium-3.1

3,502

Fiction.live

Fiction.live @ficlive

29 Sep 2025

Fiction.LiveBench for Long Context Deep Comprehension adds: deepseek-v3.2-exp [reasoning: high], deepseek-v3.2-exp, nemotron-nano-9b-v2:free, qwen-max, qwen3-next-80b-a3b-instruct.

20,220

Fiction.live

Fiction.live @ficlive

29 Sep 2025

Thoughts: Interesting that we see an improvement for deepseek's reasoning mode but no improvement for the non-reasoning. It has high scores on the easier questions but very low scores on the hard ones. grok-4-fast is fairly close to sonama-sky-alpha while still being free.

1,262

Fiction.live

Fiction.live @ficlive

29 Sep 2025

For more context on this eval: fiction.live/stories/Fiction…

1,129

Fiction.live

Fiction.live @ficlive

12 Sep 2025

Long context tested for Qwen3-next-80b-a3b-thinking. Performs very similarly to qwen3-30b-a3b-thinking-2507 and far behind qwen3-235b-a22b.

28,452

Fiction.live

Fiction.live @ficlive

12 Sep 2025

@Alibaba_Qwen You guys need better evals during development.

1,199

Fiction.live

Fiction.live @ficlive

12 Sep 2025

fiction.live/stories/Fiction…

1,157

Fiction.live

Fiction.live @ficlive

6 Sep 2025

Tested sonoma-sky-alpha and sonoma-dusk-alpha @openrouter on Fiction.liveBench sky is frontier level! If this is an efficient version that can be served cheaply, that's a fantastic outcome with great performance. Very impressed by this.

143

19,216