Building RL gyms @refresh_dev | Prev AI @Uber , CS @UofIllinois | believer in community

Joined February 2022
172 Photos and videos
Pinned Tweet
One year ago I accepted a $1M / year offer from @alexandr_wang to leave Uber to lead Scale 's ML data engine optimization team. The next week I reneged it after @sdianahu gave @erikqu_ and I an offer to join the first ever YC spring batch. Now we service the same customers that Scale does, and we do it better. Starting a company through YC has been the best thing to ever happen to us. Apply to YC, even if you apply late. Happy to review your application now or in the future, DMs open!
The deadline to apply for the spring batch is tonight at 8pm PT! All you need is an idea: ycombinator.com/apply
69
37
1,059
388,857
Christopher Settles retweeted
Can coding agents stay coherent over a 1 billion token budget? Can they build Slack from scratch? Rewrite a JAX codebase in PyTorch? Build a C compiler in Rust? Enter SWE-Marathon: a benchmark for autonomous long-horizon software work.
49
65
681
795,845
Christopher Settles retweeted
Inference Chips for Agent Workflows @sdianahu Most AI chips are designed for "prompt in, response out." Agents don't work that way. They loop, branch, and hold context across dozens of steps, and current GPUs hit 30–40% utilization as a result. That gap is where purpose-built silicon wins.
29
39
402
707,073
Christopher Settles retweeted
Supply Chain 2.0 for Semiconductors @sdianahu A single advanced AI chip crosses a dozen countries and takes five months to build, managed mostly with spreadsheets and phone calls. Real-time allocation tracking, multi-tier risk monitoring, and export compliance tooling barely exist, which is exactly why this is a startup opportunity and not a feature inside SAP.
7
8
114
52,670
Christopher Settles retweeted
Pumped for the @ycombinator @GoogleDeepMind event :)
1
23
1,184
Christopher Settles retweeted
Apr 23
Introducing GPT-5.5 A new class of intelligence for real work and powering agents, built to understand complex goals, use tools, check its work, and carry more tasks through to completion. It marks a new way of getting computer work done. Now available in ChatGPT and Codex.
2
3
49
6,134
Christopher Settles retweeted

30
72
923
174,819
Christopher Settles retweeted
Background Computer Use Computer Use in Codex has some deep OS-level wizardry. Codex can see/click/type in apps in the background, without taking over your computer, and you can work in parallel. @AriX and team absolutely crushed here. Windows soon.
So excited to share that we're bringing Computer Use to Codex. Computer Use lets Codex see, click, and type into your Mac apps, with its own cursor. It's a magical feeling to have agents using your apps in the background, and still get to use your computer at the same time.
58
48
971
124,167
Christopher Settles retweeted
interestingly this led to multiple DMs and intros. the kind of talent required is basically “new grads with high agency”. this is significant bcz its not a traditional role. you need to be: > quantitative enough to understand what makes a good training data / reward design > operationally obsessive enough to manage a cluster of contractors > potentially brand new out of college and it makes sense as the domain is this new and moving this fast, adaption is the key.
There is one most important (or hot role) emerging right now and is in huge demand, especially from data / RL env companies doing billion dollars in ARR. The role is "Strategic Project Lead" (SPL). In this role: > you are supposed to sit between frontier labs / fortune 500 and workforce of domain experts. > understand the requirement of teams and what the model needs to learn > design the training data program / RL envs and work across teams. > own the outcome end-to-end. This is to be noted that, it is neither a "ML researcher" nor a "Sales" problem. kind of a weird hybrid that requires you to > understand ML well enough to know what makes good training data / reward design > understand the business domain well enough to design relevant tasks > manage hundreds of expert contractors > operate with enough autonomy to run a multi-mill dollar program
7
9
188
28,122
Christopher Settles retweeted
you can now code from your watch. go touch grass. take that walk. leave the laptop behind. @omnaraai for apple watch is live
32
12
124
37,771
If you want to learn the skills you’ll need to become a founder, our open roles are below! We’re always looking out for strong engineers who have a knack for staying up to date with llm training research. 👇
Replying to @never_settles_
@refresh_dev MAFIAAAAA
1
3
16
1,180
Go Adam! Kill it during YC, make the @refresh_dev mafia !
I grew up at my dad's medical practice. Quickly, I realized that clinicians just want to treat patients, not deal with administrative tasks. So, my cofounder @nandaguntupalli and I are now building Taiga, a full stack medical billing service for independent practices. We handle coding, claim submission, and denials so providers can focus on patients. We’re already working with practices and helping them resolve their trickiest claims. I’ll be at Pri-Med West in Anaheim later this week. If you run or work at a small practice, I’d love to buy you coffee and learn about your billing workflow :) usetaiga.com
3
11
694
Congrats to our previous interns @nandaguntupalli and @AdamWax3 for entering in the YC P26 batch! When Erik and I met Nanda we were impressed with his drive to learn; he convinced us to let him work with us on the first call So excited to see what you guys will accomplish!
Today we're announcing that Taiga(usetaiga.com) got into the YC Spring 26 batch. Growing up, I spent a lot of time at my mom's medical practice, helping out around the office. I saw firsthand how much overhead goes into getting doctors paid. Insurance calls, denials, resubmissions, hours that should have gone to patients. That's why @AdamWax3 and I are building Taiga: a full stack billing service for independent practices. We're already helping clinicians increase revenue and cut denials. If billing is eating your week, reach out. calendar.google.com/calendar…
1
4
374
Christopher Settles retweeted
SFT for computer use saturates after 100–1000 examples. RL doesn't. 0.39 → 0.53 on aggregated UI benchmark. 20% absolute on OS-World Chrome - trained on environments with no resemblance to OS-World.
1
2
17
2,086
Christopher Settles retweeted
Mum I made it on TBPN holy shit
Mar 25
BREAKING: @ivanleomk is joining Google DeepMind
26
1
235
13,302
Christopher Settles retweeted
In 2023, WebArena took 7 grad students more than 6 months to build just 5 environments with 812 variable browser-use tasks. Now, it takes under 10 hours and less than $100 per environment, with easy support for parallel generation. Excited to introduce WebArena-Infinity: a scalable approach for automatically generating high-authenticity, high-complexity browser environments with verifiable tasks suitable for RL training and benchmarking. Even strong open-source models that already achieve 60% success rates on WebArena and OSWorld complete fewer than 50% of tasks here. Project page: webarena.dev/webarena-infini… Repo: github.com/web-arena-x/webar… 🧵 (1/n)
12
49
331
44,425
Christopher Settles retweeted

2
24
101
82,938
Christopher Settles retweeted
Here are a few of our favorite shots from our recent out-of-home campaign. Loving how the Arcee teal cuts right through the noise of downtown SF and the traffic on the 101 a bonus shot from the DC metro.
2
7
31
2,539
Christopher Settles retweeted
Our first frontier-level model! It's the result of our first continued pretraining run as well as further scaling RL. Very excited to hear how people like it! Feel free to send me feedback and we'll incorporate it into future models.
Composer 2 is now available in Cursor.
7
3
88
5,277
Christopher Settles retweeted
We're open sourcing Northstar CUA Fast, a frontier 4B open-source Computer Use Action (CUA) model, built for accuracy and long-horizon action planning.
3
7
31
2,447