@MATSprogram 10.0, CS applied math @ BYU

Joined July 2020
11 Photos and videos
Preprint 🧵! How compartmentalized are LLMs? For data in different formats (English/Chinese, Wiki/Q&A), how much transfer occurs? We provide evidence that LLMs can struggle with this sort of transfer, with consequences like sample inefficiency and capacity competition.
3
3
10
2,240
We build on existing work showing that frontier performance on all sorts of transfer is more inconsistent than we might hope, especially after learning from trillions of tokens: x.com/NitCal/status/20263003… @NitCal x.com/omerNLP/status/1907058… @omerNLP arxiv.org/abs/2408.10646 @LChoshen

Wanna check how well a model can share knowledge between languages? Of course you do! 🤩 But can you do it without access to the model’s weights? Now you can with ECLeKTic 🤯
1
1
2
411
Vin Howe retweeted
The big labs are betting RL will unlock superhuman coding. But their infrastructure is closed, and OSS tooling doesn't support true online RL—just iterative batch optimization. We're releasing ARES to close that gap 🧵
Announcing ARES - our open-source Agentic Research and Evaluation Suite. ARES is built around 3 pillars (👇 see the thread) to make reinforcement learning for code agents easy. We’ve also found it to be incredibly useful for our own mech interp research.
9
28
222
37,962
21 Nov 2025
Train a language model in your browser with WebGPU! I built a playground for training sequence models (Transformers, LSTMs, GRUs, vanilla RNNs) completely in your browser on synthetic tasks like sorting and simple natural language datasets like TinyStories. You can fiddle with 50 experiment knobs to build your own model, which can be as big as you have the VRAM to accommodate. You don't have to install anything—all you need is a browser with WebGPU support. Check it out! Link to repo blog post features and technical details in the reply. 🧵
3
3
22
2,341
21 Nov 2025
This project was inspired directly by: - @fleetwood___ Ratchet - @willdepue WebGPT - @dsmilkov, @shancarter TensorFlow Neural Network Playground - @kellerjordan0 Modded-NanoGPT and Muon - @xenovacom Transformers.js - @polodataclub Transformer Explainer - @brendanbycroft LLM Visualization - @karpathy ConvNetJS, micrograd, minGPT, llm.c
1
3
553
21 Nov 2025
Thanks to: - @grantpitt0, who helped create the original idea, provided invaluable feedback, and helped me debug a few cursed numerical bugs. - @fleetwood___ for help with Ratchet (and pushing me to write a blog post). - @bgub_ for helpful feedback. 💜
2
381
Vin Howe retweeted
19 May 2025
Excited to share what I’ve been working on with @andykonwinski, @Mike_A_Merrill, and @lschmidt3 at Stanford & Laude. Introducing Terminal-Bench! A benchmark and framework to quantify how well AI agents accomplish complex tasks in a terminal environment. We believe that the terminal is a particularly powerful tool for agents because it provides a text-based low-level interface for operating a computer to an agent.
Many agents (Claude Code, Codex CLI) interact with the terminal to do valuable tasks, but do they currently work well enough to deploy en masse? We’re excited to introduce Terminal-Bench: An evaluation environment and benchmark for AI agents on real-world terminal tasks. Tl;dr lots of room for improvement! tbench.ai/
3
6
51
6,159
Vin Howe retweeted
I'll give $1M to the first open source AI that gets 90% on this sweet new contamination-free version of SWE-bench - kprize.ai
32
116
641
102,204
Vin Howe retweeted
20 Mar 2024
My new piece in @HarvardBiz describes our work using AI to perform conflict mediation on social media, and how it inspired a new intervention by NextDoor which resulted in a 15% decrease in toxic content! hbr.org/2024/03/genai-could-…
3
17
72
7,420
Vin Howe retweeted
Excited to share our work, "Skill Set Optimization", a continual learning method for LLM actors that: - Automatically extracts modular subgoals to use as skills - Reinforces skills using environment reward - Facilitates skill retrieval based on state allenai.github.io/sso 🧵
1
23
74
15,990