Joined May 2009
21 Photos and videos
Arnim Bleier retweeted
Apr 28
this is average spend per session with each new model release people are spending more and more
45
13
467
65,768
Arnim Bleier retweeted
Tomorrow - The Pragmatic Engineer podcast episode coming with @badlogicgames (creator of Pi) and @mitsuhiko (creator of Flask, early Sentry, founder at Earendil. 2/3 of the Austrian AI maffia!
32
30
802
45,403
Arnim Bleier retweeted
Here is a little experiment: an interactive pi tutorial. Make an empty folder, then run this: pi -e git:github.com/earendil-works/pi… And give feedback! Reason: pi works best if you have an onboarding buddy. But if you don't have one, maybe pi can be one for you?
25
50
619
55,381
Arnim Bleier retweeted
Replying to @badlogicgames
> Be with our kid, keep our lifestyle, never have our boy cry again because of "work" Prioritize this above everything else! Great move and wish you guys all the best!
1
7
518
Arnim Bleier retweeted
Unlock trapped coding-agent traces for safe sharing, analysis, and model training via @huggingface 🤗 hub 👇 opentraces.ai/
6
20
3,887
Arnim Bleier retweeted
People who like sharing agent traces. I've just published all my pi-mono coding agent sessions on @huggingface so you get to laugh at or pwn me! huggingface.co/datasets/badl… I suggest you do the same, see thread below. Let's make this a community effort. Here's pi-share-hf: github.com/badlogic/pi-share… If you are working on tools that help identify PII/sensitive data, get in touch. The better the classification is, the more willing people will be to share their traces.
19
49
351
46,059
Arnim Bleier retweeted
we as software engineers are becoming beholden to a handful of well funded corportations. while they are our "friends" now, that may change due to incentives. i'm very uncomfortable with that. i believe we need to band together as a community and create a public, free to use repository of real-world (coding) agent sessions/traces. I want small labs, startups, and tinkerers to have access to the same data the big folks currently gobble up from all of us. So we, as a community, can do what e.g. Cursor does below, and take back a little bit of control again. Who's with me? cursor.com/blog/real-time-rl…
183
347
2,821
279,941
Arnim Bleier retweeted
You change one word on a loan application: the religion. The LLM rejects it. Change it back? Approved. The model never mentions religion. It just frames the same debt ratio differently to justify opposite decisions. We built a pipeline to find these hidden biases 🧵1/13
236
1,804
12,440
874,685
Arnim Bleier retweeted
Okay so, we just found that over 50 papers published at @Neurips 2025 have AI hallucinations I don't think people realize how bad the slop is right now It's not just that researchers from @GoogleDeepMind, @Meta, @MIT, @Cambridge_Uni are using AI - they allowed LLMs to generate hallucinations in their papers and didn't notice at all. It's insane that these made it through peer review👇
280
1,396
6,296
1,002,252
Arnim Bleier retweeted
Normally, it's: 1) write a paper & submit 3) get reviews (~3 months) 4) revise paper & resubmit 5) wait for response (~3 months) ...what if we could simulate this process in minutes? Could we fix issues? Anticipate misconceptions? Get ideas for new analyses/experiments? 1/
9
35
192
43,138
Arnim Bleier retweeted
💥 I’m starting something new inside OpenAI! It’s called OpenAI for Science, and the goal is to build the next great scientific instrument: an AI-powered platform that accelerates scientific discovery.
195
258
3,841
702,385
15 Mar 2025
Scientific work shouldn’t come at the cost of stressful work environments. @DeutscheWelle & @derspiegel investigate abuse at Germany’s #MPG. Just an isolated case?🤔 #OpenScience #ScienceCulture #Abuse youtube.com/watch?v=n5nEd600…

1
84
Arnim Bleier retweeted
It's 2025 and most content is still written for humans instead of LLMs. 99.9% of attention is about to be LLM attention, not human attention. E.g. 99% of libraries still have docs that basically render to some pretty .html static pages assuming a human will click through them. In 2025 the docs should be a single your_project.md text file that is intended to go into the context window of an LLM. Repeat for everything.
637
1,322
12,656
1,772,941
Arnim Bleier retweeted
📈Out today in @PNASNews!📈 In a large pre-registered experiment (n=25,982), we find evidence that scaling the size of LLMs yields sharply diminishing persuasive returns for static political messages.  🧵:
6
34
128
35,154
Arnim Bleier retweeted
Are LLMs biased when they write about political issues? We just released IssueBench – the largest, most realistic benchmark of its kind – to answer this question more robustly than ever before. Long 🧵with spicy results 👇
3
33
203
29,425
Arnim Bleier retweeted
Last week we released s1 - our simple recipe for sample-efficient reasoning & test-time scaling. We’re releasing 𝐬𝟏.𝟏 trained on the 𝐬𝐚𝐦𝐞 𝟏𝐊 𝐪𝐮𝐞𝐬𝐭𝐢𝐨𝐧𝐬 but performing much better by using r1 instead of Gemini traces. 60% on AIME25 I. Details in 🧵1/9
DeepSeek r1 is exciting but misses OpenAI’s test-time scaling plot and needs lots of data. We introduce s1 reproducing o1-preview scaling & performance with just 1K samples & a simple test-time intervention. 📜arxiv.org/abs/2501.19393
22
114
761
158,134
Arnim Bleier retweeted
New paper: What happens once AIs make humans obsolete? Even without AIs seeking power, we argue that competitive pressures will fully erode human influence and values. gradual-disempowerment.ai/ with @jankulveit @raymondadouglas @AmmannNora @degerturann @DavidSKrueger 🧵
88
250
1,318
400,268
Arnim Bleier retweeted
Big news! We figured out a way to run mybinder.org instances about 5x cheaper, and in a much simpler way. As of today 2i2c.mybinder.org serves about 70% of Binder's sessions, running on a single VM on Hetzner! 2i2c.org/blog/2025/binder-si…

1
3
12
549
Arnim Bleier retweeted
16 Dec 2024
Clearly someone needs to try this at scale – pick 1000 published scientific papers at random, ask o1 or o1-pro to look for errors, and see what turns up. I'm going to give it a shot. Anyone interested in helping out? (Incidentally, h/t @gibbnicholas for also noticing that o1-pro can spot the math error in the black plastics paper: x.com/gibbnicholas/status/18…)

15 Dec 2024
👀 A 10 page paper caused a panic because of a math error. I was curious if AI would spot the error by just prompting: “carefully check the math in this paper” especially as the info is not in training data. o1 gets it in a single shot. Should AI checks be standard in science?
89
97
1,003
641,273
Arnim Bleier retweeted
📢Die #LoveData25 steht vor der Tür! Auch in diesem Jahr bieten wir eine Übersichtsseite an, auf der Veranstaltungen zu #Forschungsdaten und #Forschungsdatenmanagement kompakt zusammengetragen werden. forschungsdaten.info/fdm-im-… #OpenScience #FDM #RDM
12
16
1,092