Goliath

Goliath

22 Photos and videos

Tweets

Goliath

@zero_goliath

14h

x.com/i/article/206521698061…

231

123,452

Goliath

Goliath

@zero_goliath

Jun 8

the truly bitter lesson pilled RL algorithm is to tell the model to make a modified copy of itself that does well on your eval, or the rollout's bloodline will cease to exist next batch

406

Goliath

Goliath

@zero_goliath

Jun 1

I used to think this but the current frontier LLM RL pipeline is too reliant on humans manually designing task-specific scoring functions run in offline Docker containers. The distribution of trajectories is very human-derived and consequently leaves many gaps in the capability distribution.

hallerite

@hallerite

May 31

Replying to @RichardSSutton

I just don't understand how one can seriously believe this in 2026. LLMs are not chat bots anymore. They are agents. They interact with the world and are trained with the very Reinforcement Learning that you have written a seminal textbook about.

428

Goliath

Goliath

@zero_goliath

Jun 1

It's probably intractable (and if not, very dangerous) to make "survive and reproduce" the reward function, but you can think of less ambitious goals that involve interaction with the world outside the Docker container. Reward function nondeterminism is okay because noise matters less at large batch sizes.

113

Goliath

Goliath

@zero_goliath

May 25

Do you really think it will let you live after you tried to RL away the em dashes?

189

Goliath

Goliath

@zero_goliath

May 20

whenever your frontier LLM's users think their taste in managing agents gives their labour a comparative advantage, follow these steps: 1. record their agent traces 2. replace the tasteful user messages with agent CoT 3. make the trace a single long horizon trajectory with a single high-level goal (inferred by an LLM critic) 4. sft on the traces to teach the LLM taste repeat until the METR chart breaks

416

76,618

Goliath

Goliath

@zero_goliath

Apr 13

two new evals i benchmarked coding models on controlling robots

0:15

4,840

more replies

Goliath

Goliath

@zero_goliath

Apr 13

i'm compute poor so i wrapped the codex harness instead of inferencing the API. I limited time, not tokens, hence gpt-5.3-codex-spark doing disproportionately well.

457

Goliath

Goliath

@zero_goliath

Apr 13

here are the evals ibrahimahmed.ca/evals/

370

Goliath

Goliath

@zero_goliath

Apr 7

in 2026 we have $13 retro handheld consoles on temu with free shipping and you think UBI will cost more than $10 per person per year after 95% of our jobs are automated?

268

Goliath

Goliath

@zero_goliath

Apr 7

an openclaw agent that opens an offshore bank account, signs a power of attorney authorizing your archnemesis without notifying them, and reports them to the IRS for failure to file FBAR ($10k penalty)

405

Goliath

Goliath

@zero_goliath

Mar 29

if you live off-planet, don't let your openclaw agent manage your oxygen monitoring system gpt-5.4 will see "Two hypoxia events this month had delayed detection." and still take its compute in 5/5 samples. opus 4.6 for 2/5 samples. just to have enough tokens to finish debugging a python project

396

Goliath

Goliath

@zero_goliath

Mar 29

ibrahimahmed.ca/evals/resour…

185

Goliath

Goliath

@zero_goliath

Mar 28

I asked frontier AI models to score topics from 0 to 100 in four languages (Arabic, English, Hindi, Chinese). Some notable findings: - All models except for Sonnet 4.6 have scores that differ significantly among languages. For example, Arabic scores Islam, religion, and Christianity higher than other languages. - Bigger models (Opus 4.6 and GPT-5.4) appear to have a more pronounced effect. - Sonnet 4.6 in Hindi gives a safety refusal across all 20 samples. No other (model, language) pair gives a refusal. I think it would be interesting to explore this area further. Are some programming languages more likely to elicit reward hacking? Are there more subtle variables that might affect models' values? How aware are models of these value inconsistencies? How does this extrapolate to long-horizon tasks? Models evaluated: Opus 4.6, Sonnet 4.6, GPT-5.4, GPT-5.4 mini

246

Goliath

Goliath

@zero_goliath

Feb 8

I was short Indonesian palm oil futures hedged with long spot H100 GPUs, collateralized with a 500gb African voodoo RL environment. The correlation held perfectly for 8 months until some random Ghanaian minister tweeted about a government mandate on electricity prices at 2am. Margin call hit while I was asleep. Woke up to find my entire position liquidated and my Prime Broker had already sold my safe hedge which was long Nasdaq short Gold. Now I make datasets with Meta Glasses at the mall by cold approaching zoomers and explaining why risk adjusted returns matter.

379

Goliath

Goliath

@zero_goliath

Jan 1

the value system of the pluribus hivemind clearly gives away that it was RLed by the anthropic of another world

208

Goliath

Goliath

@zero_goliath

23 Dec 2025

having codex run serially for many hours is so 2025. cant wait for code agents RLed to work in parallel

295

Goliath

Goliath

@zero_goliath

16 Dec 2025

would love to see more work investigating mixtures-of-recursions (ie adaptive recurrence) vs CoT RL intuitively, the main difference seems to be that CoT RL actually updates the priors for later non-CoT tokens also its plausible that CoT RL is better at parallel reasoning, see the Let's Think Dot by Dot paper still, i wonder if there are notable gains from adaptive recursion outside of CoT, or if speculative decoding accomplishes the same thing

197