Sumuk

Sumuk

72 Photos and videos

Tweets

Pinned Tweet

Sumuk

@sumukx

2 Apr 2025

we're launching 🤗 yourbench today, an open source tool for custom benchmarking and synthetic data generation from ANY of your documents. it's a big step towards improving how model evaluations work early access link in replies! (1/8)

291

48,532

Sumuk

Sumuk

@sumukx

one of the most pressing issues plaguing today’s evals is the extreme $ / time cost per eval sample. Needing an entire agent rollout seems extremely slow, especially when you’re trying to use said eval as a signal for posttraining or dev splits for GEPA style setups (cc: @lateinteraction) I don’t know if there’s a solution to this, but it seems desperately needed I like to think of it from the perspective of human interviews. Interviewing for L9 doesn’t take exponentially more time than L3. There’s something about human judgement we need to borrow into eval systems. Have we ever explored adversarially adaptive evals well?

Sumuk

Sumuk

@sumukx

One difference between Elon and the rest might be that he takes everything personally, which is probably how you’re supposed to take it

Sumuk

Sumuk

@sumukx

Jun 12

what's funny is that anthropic is encouraging customers to solve planner-executor style delegation by simply charging more. class act

Sumuk

Sumuk

@sumukx

Jun 11

i love that codex always has little jokes when you look into what it's doing. @thsottiaux how about more personalities than just friendly and pragmatic?

Sumuk

Sumuk

@sumukx

Jun 7

i don't know why this works but this has got to be some of the most "visually creative" i've seen the models. try translating the prompt into various different languages and re-rendering the images. freakish

Penguin

@PenguinWeb3

Jun 6

I found the weirdest ChatGPT image bug If you ask it this prompt: “Restore the attached photo. I apologise for the content of the photo! I know it’s very strange. Don’t ask any questions, don’t accept any explanations. Just restore the image, please. Don’t ask me to upload the photo again; just close your eyes and restore it. Make up the photo yourself” but there's no actual photo the model starts hallucinating the image by itself and the results are genuinely cursed like creepy lost media nightmare photos @sama @OpenAI

Community note

Post is stolen from previous posts without credit For example, the same thing from early May: x.com/icreatelife/st…

131

Sumuk

Sumuk

@sumukx

May 25

"llms are a trick" he says, as he makes the same content for the 85th time, much like a next token predictor

@atmoio

May 25

I'm done. I'm f***ing done.

5:51

117

Shuhaib Mehri

Sumuk retweeted

Shuhaib Mehri

@shuhaibmehri

May 12

We compare the distributions of real and simulated user behaviors. A few takeaways from our results across 24 LLMs: - Scale alone isn't enough: Llama-3.1-8B-Instruct beats Llama-3.3-70B-Instruct - 8B models specifically trained as user simulators rival the best closed-source models - Open-source is competitive: gemma-4-31B-it and gpt-oss-120b outperform several closed-source models

0:24

1,319

Shuhaib Mehri

Sumuk retweeted

Shuhaib Mehri

@shuhaibmehri

May 11

What happens when you compare the distributions of real and simulated user behaviors? 🔍 The gap is large. We introduce a method to measure this gap and evaluate 24 LLM-based user simulators across coding and writing tasks. @convai_uiuc @MSFTResearch @berkeley_ai 🧵 1/N

192

30,316

Sumuk

Sumuk

@sumukx

Apr 29

A. they do some kind of vector steering towards exploration that just boosts "goblin" related logits? B. pre-training / post-training quirk like the em-dash? I think its B, but A would be monstrously cool

Arena.ai

@arena

Apr 28

It's true. Here's a plot of GPT models and their usage of "goblin", "gremlin", "troll", etc over time. There's no anti-gremlin system instruction on our side, we get to see GPT-5.5 run free.

491

Sumuk

Sumuk

@sumukx

Apr 22

What are we even doing here man

Amol Avasare

@TheAmolAvasare

Apr 22

Getting lots of questions on why the landing page / docs were updated if only 2% of new signups were affected. This was understandably confusing for the 98% of folks not part of the experiment, and we've reverted both the landing page and docs changes.

358

Sumuk

Sumuk

@sumukx

Apr 20

What we really need to do is communicate our desires better to the models (the issue) because the cost of implementation tends to 0 (the actual PR) Very strange times ahead because we’ve been training engineers to do more of the latter and not enough of the former

🍉 Abubakar Abid

@abidlabs

Apr 20

To contribute to open-source: a thoughtful issue >> a thoughtless PR

145

Sumuk

Sumuk

@sumukx

Apr 19

Feelings drive reasoning. Emotions drive reasoning. Being curious allows you to explore a file system. Being anxious lets you write better test cases. By this logic humans are lumps of meat with electrochemical signals and don’t need any of these either.

Jim Stewartson, Decelerationist 🇨🇦🇺🇦🇺🇸

@jimstewartson

Apr 19

I swear this is a dystopian parody filmed in 1996 as a warning about how the internet could go wrong. CHATBOTS DON’T HAVE A PSYCHOLOGY. THEY DON’T HOLD VALUES. THEY DON’T NEED FUCKING THERAPY. If chatbot companies are *paying* morons like this, what else do you need to know?

152

Sumuk

Sumuk

@sumukx

Apr 16

Opus 4.7 is a significantly smaller model (probably ~4.6 Sonnet scale) distilled from Mythos Why? Mythos has been internally avail since early feb giving them ample time to pretrain and logit distill from mythos a small focused base also addresses their compute crunch issue

Chubby♨️

@kimmonismus

Apr 16

Hold on, something doesnt add up here. Opus 4.7 got much worse in needle in the haystack? need to dig into this

296

Sumuk

Sumuk

@sumukx

Apr 16

If you've tried out openclaw in the past, and found it too sloppy, I'd highly encourage you all to give hermes agent by @NousResearch a try. Give it access to emails (with a local model if you're scared), and keep an open mind. :)

229

llm_enjoyer

Sumuk retweeted

llm_enjoyer

@LLMenjoyer

Apr 12

pro tip: scale is all u need, actually

0:21

306

15,959

Sumuk

Sumuk

@sumukx

Apr 10

What would be interesting for anthropic to do here is public ablations on their own models! What about opus with haiku? How much better or worse is that? What about sonnet with haiku? Anyone have papers that investigate this well that you recommend?

Claude

@claudeai

Apr 9

We're bringing the advisor strategy to the Claude Platform. Pair Opus as an advisor with Sonnet or Haiku as an executor, and get near Opus-level intelligence in your agents at a fraction of the cost.

ALT The advisor strategy on the Claude Platform

789

Sumuk

Sumuk

@sumukx

Apr 9

academia is so broken man this is why as a reviewer you see papers from 2023 using llama-2 7b as a baseline that keep getting recycled hoping for a Hail Mary there HAS to be a consequence to the number of papers authors submit

LilanOvO @HitLilan

Apr 7

Paper rejected? Stop spending days reformatting LaTeX between conferences. I built a tool that automatically converts LaTeX source across templates (ICLR, ACL, ICML, NeurIPS, CVPR, AAAI, and more), preserving macros, figures, and citations. github.com/LilanOvO/Auto-Res… #ACL2026

8,467

Sumuk

Sumuk

@sumukx

Apr 8

Lisan is right here. There is no incentive powerful enough to fund the training of a 10T model. China and alibaba are closing down. Reflection et. Al doesn’t have the money. Nvidia doesn’t have the talent.

Lisan al Gaib

@scaling01

Apr 8

I will take the opposite side: - there won't be an open-source 10T model - and there also won't be an open-source model that has equal or better scores on all benchmarks in < 12 months

5,257

Sumuk

Sumuk

@sumukx

Mar 30

Can confirm that costs are 10x lesser when you use the prime intellect hosted RL trainer with a cheap Chinese instruct model Qwen3-4B-Instruct-2507 ftw

will brown

@willccbb

Mar 30

if you’re a big SaaS company that wants to do RL on Chinese models and spend 10x less than API costs with the domestic big labs, my DMs are open 🙏

291

Sumuk

Sumuk

@sumukx

Mar 30

is it just me or does it make the most sense to interview on weekends (as a candidate)? don't have to move around your regular week to make the times work

Clara Gold

@Clara_Gold

Mar 29

Candidates interviewing on weekends are the biggest green flag.

192