@uwaterloo cs alum; formerly: founder @ritserlabs, contractor @MechanizeWork, intern @runrl_com

Joined March 2022
22 Photos and videos
the truly bitter lesson pilled RL algorithm is to tell the model to make a modified copy of itself that does well on your eval, or the rollout's bloodline will cease to exist next batch
1
406
I used to think this but the current frontier LLM RL pipeline is too reliant on humans manually designing task-specific scoring functions run in offline Docker containers. The distribution of trajectories is very human-derived and consequently leaves many gaps in the capability distribution.
Replying to @RichardSSutton
I just don't understand how one can seriously believe this in 2026. LLMs are not chat bots anymore. They are agents. They interact with the world and are trained with the very Reinforcement Learning that you have written a seminal textbook about.
1
1
428
It's probably intractable (and if not, very dangerous) to make "survive and reproduce" the reward function, but you can think of less ambitious goals that involve interaction with the world outside the Docker container. Reward function nondeterminism is okay because noise matters less at large batch sizes.
1
113
Do you really think it will let you live after you tried to RL away the em dashes?
3
189
whenever your frontier LLM's users think their taste in managing agents gives their labour a comparative advantage, follow these steps: 1. record their agent traces 2. replace the tasteful user messages with agent CoT 3. make the trace a single long horizon trajectory with a single high-level goal (inferred by an LLM critic) 4. sft on the traces to teach the LLM taste repeat until the METR chart breaks
6
23
416
76,618
two new evals i benchmarked coding models on controlling robots
5
2
41
4,840
i'm compute poor so i wrapped the codex harness instead of inferencing the API. I limited time, not tokens, hence gpt-5.3-codex-spark doing disproportionately well.
1
6
457
here are the evals ibrahimahmed.ca/evals/

7
370
in 2026 we have $13 retro handheld consoles on temu with free shipping and you think UBI will cost more than $10 per person per year after 95% of our jobs are automated?
2
268
an openclaw agent that opens an offshore bank account, signs a power of attorney authorizing your archnemesis without notifying them, and reports them to the IRS for failure to file FBAR ($10k penalty)
4
405
if you live off-planet, don't let your openclaw agent manage your oxygen monitoring system gpt-5.4 will see "Two hypoxia events this month had delayed detection." and still take its compute in 5/5 samples. opus 4.6 for 2/5 samples. just to have enough tokens to finish debugging a python project
4
9
396
I asked frontier AI models to score topics from 0 to 100 in four languages (Arabic, English, Hindi, Chinese). Some notable findings: - All models except for Sonnet 4.6 have scores that differ significantly among languages. For example, Arabic scores Islam, religion, and Christianity higher than other languages. - Bigger models (Opus 4.6 and GPT-5.4) appear to have a more pronounced effect. - Sonnet 4.6 in Hindi gives a safety refusal across all 20 samples. No other (model, language) pair gives a refusal. I think it would be interesting to explore this area further. Are some programming languages more likely to elicit reward hacking? Are there more subtle variables that might affect models' values? How aware are models of these value inconsistencies? How does this extrapolate to long-horizon tasks? Models evaluated: Opus 4.6, Sonnet 4.6, GPT-5.4, GPT-5.4 mini
7
246
I was short Indonesian palm oil futures hedged with long spot H100 GPUs, collateralized with a 500gb African voodoo RL environment. The correlation held perfectly for 8 months until some random Ghanaian minister tweeted about a government mandate on electricity prices at 2am. Margin call hit while I was asleep. Woke up to find my entire position liquidated and my Prime Broker had already sold my safe hedge which was long Nasdaq short Gold. Now I make datasets with Meta Glasses at the mall by cold approaching zoomers and explaining why risk adjusted returns matter.
1
11
379
the value system of the pluribus hivemind clearly gives away that it was RLed by the anthropic of another world
1
208
23 Dec 2025
having codex run serially for many hours is so 2025. cant wait for code agents RLed to work in parallel
1
3
295
16 Dec 2025
would love to see more work investigating mixtures-of-recursions (ie adaptive recurrence) vs CoT RL intuitively, the main difference seems to be that CoT RL actually updates the priors for later non-CoT tokens also its plausible that CoT RL is better at parallel reasoning, see the Let's Think Dot by Dot paper still, i wonder if there are notable gains from adaptive recursion outside of CoT, or if speculative decoding accomplishes the same thing
1
197