Joined September 2020
18 Photos and videos
Pinned Tweet
6 Jun 2024
When LLMs are unsure, they either hallucinate or abstain. Ideally, they should clearly express truthful confidence levels. Our #ICML2024 work designs an alignment objective to achieve this notion of linguistic calibration in *long-form generations*. arxiv.org/abs/2404.00474 🧵
10
45
305
73,396
Neil Band retweeted
What if diffusion models could think ahead instead of being greedy at every step?🤔 We introduce: Learned Relay Representations for Forward-Thinking Discrete Diffusion Models
1
8
36
3,469
Neil Band retweeted
Can a language model learn, end-to-end, what to keep in its own KV cache and what to throw away? Can it learn to forget while it learns to reason? Deep learning's central lesson: capability emerges from end-to-end optimization, not heuristics/strong inductive biases. But for efficiency, we rely heavily on hand-designed approaches. 🗑️ Introducing Neural Garbage Collection (NGC): we train a language model to jointly reason and manage its own KV cache, using reinforcement learning with outcome-based task reward alone. No SFT, no proxy objectives, no summarization in natural language. New paper with @jubayer_hamid, Emily Fox, and @noahdgoodman!
30
132
906
164,847
Neil Band retweeted
A synthetic data generation method that, when a model is trained on the generated data, it maximizes a certain differentiable objective. e.g. it is possible to make data that engraves a QR code in the weights of an LM head. (Or, more conventional things like translating documents to improve target language loss.)
1
35
304
59,992
Neil Band retweeted
New paper! Want to precisely optimize synthetic training data to do practical or even wacky things? Dataset Policy Gradients get you there, letting you target any differentiable training or post-training metric. We embedded a QR code in GPT-2’s weights using only training data!
7
41
225
47,304
Neil Band retweeted
Our new paper, “End-to-End Test-Time Training for Long Context,” is a step towards continual learning in language models. We introduce a new method that blurs the boundary between training and inference. At test-time, our model continues learning from given context using the same next-token prediction objective as training. With this end-to-end objective, our model can efficiently compress substantial context into its weights and still use it effectively, unlocking extremely long context windows for complex reasoning and applications in agents and robotics. Paper: test-time-training.github.io… Code: github.com/test-time-trainin…
42
208
1,162
185,429
3 Dec 2025
Tim's an amazing researcher and mentor, go work with him!
I'm so happy to share that I’ll be joining @UofT as an Assistant Professor of Statistical Sciences and Computer Science, with an appointment at the @VectorInst, in 2026! I'm recruiting postdocs and PhD students: timrudner.com! Please help me spread the word! 🧵(1/5)
1
330
Neil Band retweeted
Data centers dominate AI, but they're hitting physical limits. What if the future of AI isn't just bigger data centers, but local intelligence in our hands? The viability of local AI depends on intelligence efficiency. To measure this, we propose intelligence per watt (IPW): intelligence delivered (capabilities) per unit of power consumed (efficiency). Today’s Local LMs already handle 88.7% of single-turn chat and reasoning queries, with local IPW improving 5.3× in 2 years—driven by better models (3.2×) and better accelerators (1.7×). As local IPW improves, a meaningful fraction of workloads can shift from centralized infrastructure to local compute, with IPW serving as the critical metric for tracking this transition. (1/N)
56
142
463
229,408
Neil Band retweeted
Since compute grows faster than the web, we think the future of pre-training lies in the algorithms that will best leverage ♾ compute We find simple recipes that improve the asymptote of compute scaling laws to be 5x data efficient, offering better perf w/ sufficient compute
10
84
446
153,443
Neil Band retweeted
4 Sep 2025
(1/n) Check out our new paper: "Fantastic Pretraining Optimizers and Where to Find Them"! >4000 models to find the fastest optimizer! 2× speedups over AdamW? Unlikely. Beware under-tuned baseline or limited scale! E.g. Muon: ~40% speedups <0.5B & only 10% at 1.2B (8× Chinchilla)!
13
97
446
183,956
Neil Band retweeted
Can AI solve open problems in math, physics, coding, medical sciences & beyond? We collected unsolved questions (UQ) & tested frontier LLMs. Some solutions passed expert validation…
27
140
496
86,161
Neil Band retweeted
30 Jun 2025
Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.
12
182
633
152,886
Neil Band retweeted
25 Jun 2025
Very cool thread about the CS336 Language Models from Scratch course at Stanford taught by @percyliang et al. Makes me wish I was a student again!
18 Jun 2025
Wrapped up Stanford CS336 (Language Models from Scratch), taught with an amazing team @tatsu_hashimoto @marcelroed @neilbband @rckpudi. Researchers are becoming detached from the technical details of how LMs work. In CS336, we try to fix that by having students build everything:
19
81
962
112,207
Neil Band retweeted
How can we close the generation-verification gap when LLMs produce correct answers but fail to select them? 🧵 Introducing Weaver: a framework that combines multiple weak verifiers (reward models LM judges) to achieve o3-mini-level accuracy with much cheaper non-reasoning models like Llama 3.3 70B Instruct! 🧵(1 / N)
11
60
224
81,719
Neil Band retweeted
18 Jun 2025
Wrapped up Stanford CS336 (Language Models from Scratch), taught with an amazing team @tatsu_hashimoto @marcelroed @neilbband @rckpudi. Researchers are becoming detached from the technical details of how LMs work. In CS336, we try to fix that by having students build everything:
46
570
4,922
679,080
Neil Band retweeted
Announcing OpenThinker3-7B, the new SOTA open-data 7B reasoning model: improving over DeepSeek-R1-Distill-Qwen-7B by 33% on average over code, science, and math evals. We also release our dataset, OpenThoughts3-1.2M, which is the best open reasoning dataset across all data scales. Full details are in our ✨new paper✨ - below we share the highlights: BTW, it also works on non-Qwen models😉 (1/N)
34
192
923
200,643
Neil Band retweeted
Designed some graphics for Stanford CS336 (Language Modeling from Scratch) by @percyliang @tatsu_hashimoto @marcelroed @neilbband @rckpudi Covering four assignments 📚 that teach you how to 🧑‍🍳 cook an LLM from scratch: - Build and Train a Tokenizer 🔤 - Write Triton kernels for Attention ⚡️ - Construct Scaling Laws 📉 - Implement GRPO 🐙
11
59
645
68,391
Neil Band retweeted
Synthetic Continued Pretraining (arxiv.org/pdf/2409.07431) has been accepted as an Oral Presentation at #ICLR2025! We tackle the challenge of data-efficient language model pretraining: how to teach an LM the knowledge of small, niche corpora, such as the latest arXiv preprints.
Grab your favorite preprint of the week: how can you put its knowledge in your LM’s parameters? Continued pretraining (CPT) works well with >10B tokens, but the preprint is <10K. Synthetic CPT downscales CPT to such small, targeted domains. 📜: arxiv.org/abs/2409.07431 🧵👇
1
12
83
11,336
Neil Band retweeted
I think CS336 has one of the best LLM problem sets of any AI/LM class thanks to our incredible TAs (@nelsonfliu,@GabrielPoesia,@marcelroed,@neilbband,@rckpudi). We're making it so you can do it all at home, and it's one of the best ways to learn LLMs deeply.
Want to learn the engineering details of building state-of-the-art Large Language Models (LLMs)? Not finding much info in @OpenAI’s non-technical reports? @percyliang and @tatsu_hashimoto are here to help with CS336: Language Modeling from Scratch, now rolling out to YouTube.
10
59
719
82,236
Neil Band retweeted
Want to learn the engineering details of building state-of-the-art Large Language Models (LLMs)? Not finding much info in @OpenAI’s non-technical reports? @percyliang and @tatsu_hashimoto are here to help with CS336: Language Modeling from Scratch, now rolling out to YouTube.
9
157
1,112
201,320
Neil Band retweeted
3 Apr 2025
Turns out, it’s possible to outperform DeepSeekR1-32B with only SFT on open data and no RL: Announcing OpenThinker2-32B and OpenThinker2-7B. We also release the data, OpenThoughts2-1M, curated by selecting quality instructions from diverse sources. 🧵 (1/n)
19
129
466
89,705