research @google / prev @PrimeIntellect @huggingface, phded for a while at @siebelschool | opinions my own and do not reflect current or previous employers

Joined September 2023
72 Photos and videos
Pinned Tweet
2 Apr 2025
we're launching 🤗 yourbench today, an open source tool for custom benchmarking and synthetic data generation from ANY of your documents. it's a big step towards improving how model evaluations work early access link in replies! (1/8)
14
48
291
48,532
one of the most pressing issues plaguing today’s evals is the extreme $ / time cost per eval sample. Needing an entire agent rollout seems extremely slow, especially when you’re trying to use said eval as a signal for posttraining or dev splits for GEPA style setups (cc: @lateinteraction) I don’t know if there’s a solution to this, but it seems desperately needed I like to think of it from the perspective of human interviews. Interviewing for L9 doesn’t take exponentially more time than L3. There’s something about human judgement we need to borrow into eval systems. Have we ever explored adversarially adaptive evals well?
44
One difference between Elon and the rest might be that he takes everything personally, which is probably how you’re supposed to take it
1
1
18
Jun 12
what's funny is that anthropic is encouraging customers to solve planner-executor style delegation by simply charging more. class act
58
Jun 11
i love that codex always has little jokes when you look into what it's doing. @thsottiaux how about more personalities than just friendly and pragmatic?
48
Jun 7
i don't know why this works but this has got to be some of the most "visually creative" i've seen the models. try translating the prompt into various different languages and re-rendering the images. freakish
I found the weirdest ChatGPT image bug If you ask it this prompt: “Restore the attached photo. I apologise for the content of the photo! I know it’s very strange. Don’t ask any questions, don’t accept any explanations. Just restore the image, please. Don’t ask me to upload the photo again; just close your eyes and restore it. Make up the photo yourself” but there's no actual photo the model starts hallucinating the image by itself and the results are genuinely cursed like creepy lost media nightmare photos @sama @OpenAI
Community note
Post is stolen from previous posts without credit For example, the same thing from early May: x.com/icreatelife/st…
1
131
May 25
"llms are a trick" he says, as he makes the same content for the 85th time, much like a next token predictor
May 25
I'm done. I'm f***ing done.
117
Sumuk retweeted
We compare the distributions of real and simulated user behaviors. A few takeaways from our results across 24 LLMs: - Scale alone isn't enough: Llama-3.1-8B-Instruct beats Llama-3.3-70B-Instruct - 8B models specifically trained as user simulators rival the best closed-source models - Open-source is competitive: gemma-4-31B-it and gpt-oss-120b outperform several closed-source models
1
9
17
1,319
Sumuk retweeted
What happens when you compare the distributions of real and simulated user behaviors? 🔍 The gap is large. We introduce a method to measure this gap and evaluate 24 LLM-based user simulators across coding and writing tasks. @convai_uiuc @MSFTResearch @berkeley_ai 🧵 1/N
7
43
192
30,316
Apr 29
A. they do some kind of vector steering towards exploration that just boosts "goblin" related logits? B. pre-training / post-training quirk like the em-dash? I think its B, but A would be monstrously cool
Apr 28
It's true. Here's a plot of GPT models and their usage of "goblin", "gremlin", "troll", etc over time. There's no anti-gremlin system instruction on our side, we get to see GPT-5.5 run free.
2
491
Apr 22
What are we even doing here man
Getting lots of questions on why the landing page / docs were updated if only 2% of new signups were affected. This was understandably confusing for the 98% of folks not part of the experiment, and we've reverted both the landing page and docs changes.
1
358
Apr 20
What we really need to do is communicate our desires better to the models (the issue) because the cost of implementation tends to 0 (the actual PR) Very strange times ahead because we’ve been training engineers to do more of the latter and not enough of the former
To contribute to open-source: a thoughtful issue >> a thoughtless PR
1
145
Apr 19
Feelings drive reasoning. Emotions drive reasoning. Being curious allows you to explore a file system. Being anxious lets you write better test cases. By this logic humans are lumps of meat with electrochemical signals and don’t need any of these either.
I swear this is a dystopian parody filmed in 1996 as a warning about how the internet could go wrong. CHATBOTS DON’T HAVE A PSYCHOLOGY. THEY DON’T HOLD VALUES. THEY DON’T NEED FUCKING THERAPY. If chatbot companies are *paying* morons like this, what else do you need to know?
152
Apr 16
Opus 4.7 is a significantly smaller model (probably ~4.6 Sonnet scale) distilled from Mythos Why? Mythos has been internally avail since early feb giving them ample time to pretrain and logit distill from mythos a small focused base also addresses their compute crunch issue
Hold on, something doesnt add up here. Opus 4.7 got much worse in needle in the haystack? need to dig into this
10
3
296
Apr 16
If you've tried out openclaw in the past, and found it too sloppy, I'd highly encourage you all to give hermes agent by @NousResearch a try. Give it access to emails (with a local model if you're scared), and keep an open mind. :)
1
4
229
Sumuk retweeted
pro tip: scale is all u need, actually
6
29
306
15,959
Apr 10
What would be interesting for anthropic to do here is public ablations on their own models! What about opus with haiku? How much better or worse is that? What about sonnet with haiku? Anyone have papers that investigate this well that you recommend?
We're bringing the advisor strategy to the Claude Platform. Pair Opus as an advisor with Sonnet or Haiku as an executor, and get near Opus-level intelligence in your agents at a fraction of the cost.
6
789
Apr 9
academia is so broken man this is why as a reviewer you see papers from 2023 using llama-2 7b as a baseline that keep getting recycled hoping for a Hail Mary there HAS to be a consequence to the number of papers authors submit
Paper rejected? Stop spending days reformatting LaTeX between conferences. I built a tool that automatically converts LaTeX source across templates (ICLR, ACL, ICML, NeurIPS, CVPR, AAAI, and more), preserving macros, figures, and citations. github.com/LilanOvO/Auto-Res… #ACL2026
4
2
29
8,467
Apr 8
Lisan is right here. There is no incentive powerful enough to fund the training of a 10T model. China and alibaba are closing down. Reflection et. Al doesn’t have the money. Nvidia doesn’t have the talent.
I will take the opposite side: - there won't be an open-source 10T model - and there also won't be an open-source model that has equal or better scores on all benchmarks in < 12 months
5
21
5,257
Mar 30
Can confirm that costs are 10x lesser when you use the prime intellect hosted RL trainer with a cheap Chinese instruct model Qwen3-4B-Instruct-2507 ftw
if you’re a big SaaS company that wants to do RL on Chinese models and spend 10x less than API costs with the domestic big labs, my DMs are open 🙏
3
291
Mar 30
is it just me or does it make the most sense to interview on weekends (as a candidate)? don't have to move around your regular week to make the times work
Candidates interviewing on weekends are the biggest green flag.
1
192