Dominik Filkus

Dominik Filkus

846 Photos and videos

Tweets

Pinned Tweet

Dominik Filkus

@DominikFilkus

Jun 9

Claude Fable 5 just oneshoted this Super Pang game. I think we reached the peak with this test and have to find more complex ones for quick game challenges.

0:34

4,413

Dominik Filkus

Dominik Filkus

@DominikFilkus

Jun 14

I just came across an interesting benchmark from @andonlabs called Blueprint-Bench 2. According to the description on the page, it works as follows: "It tests spatial reasoning by asking AI agents to convert apartment photographs into accurate 2D floor plans. Each agent processes 50 apartments sequentially, examining around 20 interior photos per apartment and generating a floor plan that shows room layouts, connections, and relative sizes." What I find particularly interesting is that Fable actually won this benchmark, while other Claude models performed well below SOTA models such as GPT-5.5 and even GPT-5.4, not to mention the Gemini models.

140

Dominik Filkus

Dominik Filkus

@DominikFilkus

Jun 13

Peekaboo! There was Fable, there is no Fable!

Anthropic

@AnthropicAI

Jun 13

The US government, citing national security authorities, has issued an export control directive to suspend all access to Fable 5 and Mythos 5 by any foreign national, whether inside or outside the United States, including foreign national Anthropic employees. The net effect of this order is that we must abruptly disable Fable 5 and Mythos 5 for all our customers to ensure compliance. Access to all other Claude models is not affected. We apologize for this disruption to our customers. We believe this is a misunderstanding and are working to restore access as soon as possible. Read our full statement: anthropic.com/news/fable-myt…

Artificial Analysis

Dominik Filkus retweeted

Artificial Analysis

@ArtificialAnlys

Jun 12

We've updated the Artificial Analysis Coding Agent Index, replacing SWE-Bench Pro with Datacurve's DeepSWE benchmark - the swap lifts Codex with GPT-5.5 (xhigh) above Claude Code with Opus 4.8 (max), while the newly released Claude Fable 5 (max) in Claude Code debuts at the top DeepSWE, built by @datacurve, writes its tasks from scratch rather than adapting them from public GitHub issues or pull requests, so no model has seen the solutions during training. That matters because SWE-Bench Pro, the benchmark it replaces in our Coding Agent Index, had grown gameable, with some models recovering the fix from the repository's commit history instead of solving the task. The swap reorders the index: Codex with GPT-5.5 (xhigh) rises from 65 to 76, overtaking Claude Code with Opus 4.8 (max) at 73. Claude Code with Fable 5 (max), which enters directly on the refreshed index, leads at 77. SWE-Bench Pro had been flattering some combinations and penalizing others. More below.

107

185

1,903

538,335

Dominik Filkus

Dominik Filkus

@DominikFilkus

Jun 1

Even @t3dotchat has UX problems. This one loads forever. I do not know what is happening nowadays but seems everything is broken. Vibe coding ?

165

Dominik Filkus

Dominik Filkus

@DominikFilkus

May 31

The @GoogleDeepMind Gemini Omni experience is so garbage that I can't even find words for it. Gemini app might be fine, but Flow is unusable.

267

Dominik Filkus

Dominik Filkus

@DominikFilkus

May 31

Generations could be stuck too forever. Oh and Google Flow can't even work with mp3 files which are generated directly with Google Lyria 🤷‍♂️

121

Claude

Dominik Filkus retweeted

Claude

@claudeai

May 28

Introducing Claude Opus 4.8: it builds on Opus 4.7 with sharper judgment, more honesty about its own progress, and the ability to work independently for longer than its predecessors. Available today at the same price.

Benchmark table showing how Claude Opus 4.8 compares to its predecessor and to other models on tests of coding, agentic skills, reasoning, and practical knowledge work tasks.

ALT Benchmark table showing how Claude Opus 4.8 compares to its predecessor and to other models on tests of coding, agentic skills, reasoning, and practical knowledge work tasks.

3,687

8,628

67,437

15,240,629

Serena Ge (Datacurve)

Dominik Filkus retweeted

Serena Ge (Datacurve)

@serenaa_ge

May 26

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

511

742

6,053

1,951,636

Logan Kilpatrick

Dominik Filkus retweeted

Logan Kilpatrick

@OfficialLoganK

May 19

Welcome to Gemini 3.5 Flash, our most powerful model to date. It pushes the frontier of intelligence, speed, and cost putting 3.5 Flash in a class of its own. We spent the last 6 months making sure Flash is great for real world use cases. It's available everywhere now!

469

736

7,364

666,935

Andrej Karpathy

Dominik Filkus retweeted

Andrej Karpathy

@karpathy

May 19

Personal update: I've joined Anthropic. I think the next few years at the frontier of LLMs will be especially formative. I am very excited to join the team here and get back to R&D. I remain deeply passionate about education and plan to resume my work on it in time.

7,989

11,150

150,232

27,570,655

Dominik Filkus

Dominik Filkus

@DominikFilkus

May 16

This one looks awesome imo. Mixing styles with Krea 2 is easy and fun at the same time.

128

Thinking Machines

Dominik Filkus retweeted

Thinking Machines

@thinkymachines

May 11

People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way. We share our approach, early results, and a quick look at our model in action. thinkingmachines.ai/blog/int…

2:15

464

1,958

15,785

7,749,182

Dominik Filkus

Dominik Filkus

@DominikFilkus

May 16

I've been playing around a bit with @krea_ai Moodboards. Here is my pool in the sky.

Krea

Dominik Filkus retweeted

Krea

@krea_ai

May 12

this is Krea 2. our first foundation model, built completely from scratch for aesthetic diversity and stylistic control. learn more and get early access 👇

1:07

206

209

2,226

2,328,238

Dominik Filkus

Dominik Filkus

@DominikFilkus

May 14

I almost forgot about this. Oh gosh, what a ride it was. There was the DeepSeek panic, the Blackwell production FUD, the smuggling to China FUD, Google TPUs, and of course all the other hyperscalers producing their own chips so they did not need Nvidia at all. Nope, none of them were enough to change the truth but I am always grateful for the discounts. $NVDA

Dominik Filkus

@DominikFilkus

8 Jan 2024

Whenever Nvidia reaches ATH, this song comes to my mind. $NVDA @nvidia

0:15

129

Dominik Filkus

Dominik Filkus

@DominikFilkus

May 12

This is getting worse...

International Cyber Digest

@IntCyberDigest

May 11

‼️🚨 BREAKING: A new npm supply-chain attack uses a dead-man's switch. The payload plants a watcher on your machine that nukes your home directory the second you revoke the GitHub token it stole from you. The compromise happened today, across 42 official tanstack npm packages, 84 malicious versions in total. tanstack/react-router alone pulls more than 12 million weekly downloads. The attacker forked TanStack's repository and pushed a single hidden commit. From there, they tricked TanStack's own release system into signing the malicious packages as if they were the real thing. To npm, and to anyone checking the cryptographic proof of origin (SLSA provenance), the poisoned versions looked 100% legitimate. Maintainer Tanner Linsley confirmed the whole team had 2FA enabled. It didn't matter. This is the first documented npm worm in history that ships with a valid, signed certificate of authenticity, the same one defenders rely on to know a package wasn't tampered with.

107

Dominik Filkus

Dominik Filkus

@DominikFilkus

May 10

Who could be behind this archaeopteryx codenamed model? Every time I try to generate an image on Arena AI, this comes to my screen and I think the quality is quite good.

139

Figure

Dominik Filkus retweeted

Figure

@Figure_robot

May 8

We taught two F.03 robots to clean a room and make a bed in under 2 minutes - fully autonomous.

2:14

669

1,115

8,346

1,388,029

Dominik Filkus

Dominik Filkus

@DominikFilkus

May 7

Ok so our extended limits totally depend on Elon's actual mood. 🤷‍♂️