Neil Band

Neil Band

18 Photos and videos

Tweets

Pinned Tweet

Neil Band @neilbband

6 Jun 2024

When LLMs are unsure, they either hallucinate or abstain. Ideally, they should clearly express truthful confidence levels. Our #ICML2024 work designs an alignment objective to achieve this notion of linguistic calibration in *long-form generations*. arxiv.org/abs/2404.00474 🧵

305

73,396

Tim G. J. Rudner

Neil Band retweeted

Tim G. J. Rudner

@timrudner

Jun 9

What if diffusion models could think ahead instead of being greedy at every step?🤔 We introduce: Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

3,469

Michael Y. Li

Neil Band retweeted

Michael Y. Li

@michaelyli_

Apr 22

Can a language model learn, end-to-end, what to keep in its own KV cache and what to throw away? Can it learn to forget while it learns to reason? Deep learning's central lesson: capability emerges from end-to-end optimization, not heuristics/strong inductive biases. But for efficiency, we rely heavily on hand-designed approaches. 🗑️ Introducing Neural Garbage Collection (NGC): we train a language model to jointly reason and manage its own KV cache, using reinforcement learning with outcome-based task reward alone. No SFT, no proxy objectives, no summarization in natural language. New paper with @jubayer_hamid, Emily Fox, and @noahdgoodman!

132

906

164,847

Rosinality

Neil Band retweeted

Rosinality @rosinality

Apr 10

A synthetic data generation method that, when a model is trained on the generated data, it maximizes a certain differentiable objective. e.g. it is possible to make data that engraves a QR code in the weights of an LM head. (Or, more conventional things like translating documents to improve target language loss.)

304

59,992

Tristan Thrush

Neil Band retweeted

Tristan Thrush @TristanThrush

Apr 10

New paper! Want to precisely optimize synthetic training data to do practical or even wacky things? Dataset Policy Gradients get you there, letting you target any differentiable training or post-training metric. We embedded a QR code in GPT-2’s weights using only training data!

225

47,304

Karan Dalal

Neil Band retweeted

Karan Dalal

@karansdalal

29 Dec 2025

Our new paper, “End-to-End Test-Time Training for Long Context,” is a step towards continual learning in language models. We introduce a new method that blurs the boundary between training and inference. At test-time, our model continues learning from given context using the same next-token prediction objective as training. With this end-to-end objective, our model can efficiently compress substantial context into its weights and still use it effectively, unlocking extremely long context windows for complex reasoning and applications in agents and robotics. Paper: test-time-training.github.io… Code: github.com/test-time-trainin…

208

1,162

185,429

Neil Band

Neil Band @neilbband

3 Dec 2025

Tim's an amazing researcher and mentor, go work with him!

Tim G. J. Rudner

@timrudner

2 Dec 2025

I'm so happy to share that I’ll be joining @UofT as an Assistant Professor of Statistical Sciences and Computer Science, with an appointment at the @VectorInst, in 2026! I'm recruiting postdocs and PhD students: timrudner.com! Please help me spread the word! 🧵(1/5)

330

Jon Saad-Falcon

Neil Band retweeted

Jon Saad-Falcon

@JonSaadFalcon

12 Nov 2025

Data centers dominate AI, but they're hitting physical limits. What if the future of AI isn't just bigger data centers, but local intelligence in our hands? The viability of local AI depends on intelligence efficiency. To measure this, we propose intelligence per watt (IPW): intelligence delivered (capabilities) per unit of power consumed (efficiency). Today’s Local LMs already handle 88.7% of single-turn chat and reasoning queries, with local IPW improving 5.3× in 2 years—driven by better models (3.2×) and better accelerators (1.7×). As local IPW improves, a meaningful fraction of workloads can shift from centralized infrastructure to local compute, with IPW serving as the critical metric for tracking this transition. (1/N)

142

463

229,408

Suhas Kotha

Neil Band retweeted

Suhas Kotha @kothasuhas

19 Sep 2025

Since compute grows faster than the web, we think the future of pre-training lies in the algorithms that will best leverage ♾ compute We find simple recipes that improve the asymptote of compute scaling laws to be 5x data efficient, offering better perf w/ sufficient compute

446

153,443

Kaiyue Wen

Neil Band retweeted

Kaiyue Wen

@wen_kaiyue

4 Sep 2025

(1/n) Check out our new paper: "Fantastic Pretraining Optimizers and Where to Find Them"! >4000 models to find the fastest optimizer! 2× speedups over AdamW? Unlikely. Beware under-tuned baseline or limited scale! E.g. Muon: ~40% speedups <0.5B & only 10% at 1.2B (8× Chinchilla)!

446

183,956

Niklas Muennighoff

Neil Band retweeted

Niklas Muennighoff @Muennighoff

26 Aug 2025

Can AI solve open problems in math, physics, coding, medical sciences & beyond? We collected unsolved questions (UQ) & tested frontier LLMs. Some solutions passed expert validation…

140

496

86,161

CLS

Neil Band retweeted

CLS

@ChengleiSi

30 Jun 2025

Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.

182

633

152,886

Jeff Dean

Neil Band retweeted

Jeff Dean

@JeffDean

25 Jun 2025

Very cool thread about the CS336 Language Models from Scratch course at Stanford taught by @percyliang et al. Makes me wish I was a student again!

Percy Liang

@percyliang

18 Jun 2025

Wrapped up Stanford CS336 (Language Models from Scratch), taught with an amazing team @tatsu_hashimoto @marcelroed @neilbband @rckpudi. Researchers are becoming detached from the technical details of how LMs work. In CS336, we try to fix that by having students build everything:

962

112,207

Jon Saad-Falcon

Neil Band retweeted

Jon Saad-Falcon

@JonSaadFalcon

24 Jun 2025

How can we close the generation-verification gap when LLMs produce correct answers but fail to select them? 🧵 Introducing Weaver: a framework that combines multiple weak verifiers (reward models LM judges) to achieve o3-mini-level accuracy with much cheaper non-reasoning models like Llama 3.3 70B Instruct! 🧵(1 / N)

224

81,719

Percy Liang

Neil Band retweeted

Percy Liang

@percyliang

18 Jun 2025

570

4,922

679,080

Ryan Marten

Neil Band retweeted

Ryan Marten

@ryanmart3n

5 Jun 2025

Announcing OpenThinker3-7B, the new SOTA open-data 7B reasoning model: improving over DeepSeek-R1-Distill-Qwen-7B by 33% on average over code, science, and math evals. We also release our dataset, OpenThoughts3-1.2M, which is the best open reasoning dataset across all data scales. Full details are in our ✨new paper✨ - below we share the highlights: BTW, it also works on non-Qwen models😉 (1/N)

192

923

200,643

Simon Guo

Neil Band retweeted

Simon Guo

@simonguozirui

4 Jun 2025

Designed some graphics for Stanford CS336 (Language Modeling from Scratch) by @percyliang @tatsu_hashimoto @marcelroed @neilbband @rckpudi Covering four assignments 📚 that teach you how to 🧑‍🍳 cook an LLM from scratch: - Build and Train a Tokenizer 🔤 - Write Triton kernels for Attention ⚡️ - Construct Scaling Laws 📉 - Implement GRPO 🐙

645

68,391

Zitong Yang

Neil Band retweeted

Zitong Yang

@ZitongYang0

21 Apr 2025

Synthetic Continued Pretraining (arxiv.org/pdf/2409.07431) has been accepted as an Oral Presentation at #ICLR2025! We tackle the challenge of data-efficient language model pretraining: how to teach an LM the knowledge of small, niche corpora, such as the latest arXiv preprints.

Zitong Yang

@ZitongYang0

12 Sep 2024

Grab your favorite preprint of the week: how can you put its knowledge in your LM’s parameters? Continued pretraining (CPT) works well with >10B tokens, but the preprint is <10K. Synthetic CPT downscales CPT to such small, targeted domains. 📜: arxiv.org/abs/2409.07431 🧵👇

11,336

Tatsunori Hashimoto

Neil Band retweeted

Tatsunori Hashimoto @tatsu_hashimoto

19 Apr 2025

I think CS336 has one of the best LLM problem sets of any AI/LM class thanks to our incredible TAs (@nelsonfliu,@GabrielPoesia,@marcelroed,@neilbband,@rckpudi). We're making it so you can do it all at home, and it's one of the best ways to learn LLMs deeply.

Stanford NLP Group

@stanfordnlp

19 Apr 2025

Want to learn the engineering details of building state-of-the-art Large Language Models (LLMs)? Not finding much info in @OpenAI’s non-technical reports? @percyliang and @tatsu_hashimoto are here to help with CS336: Language Modeling from Scratch, now rolling out to YouTube.

719

82,236

Stanford NLP Group

Neil Band retweeted

Stanford NLP Group

@stanfordnlp

19 Apr 2025

157

1,112

201,320

Etash Guha

Neil Band retweeted

Etash Guha

@etash_guha

3 Apr 2025

Turns out, it’s possible to outperform DeepSeekR1-32B with only SFT on open data and no RL: Announcing OpenThinker2-32B and OpenThinker2-7B. We also release the data, OpenThoughts2-1M, curated by selecting quality instructions from diverse sources. 🧵 (1/n)

129

466

89,705