Tomasz Limisiewicz

Tomasz Limisiewicz

55 Photos and videos

Tweets

Pinned Tweet

Tomasz Limisiewicz @TomLimi

May 4

We present Compute Optimal Tokenization! 🔡 Common in LLM scaling works stick to one tokenizer, sweeping data/model size. But what happens when we control the tokenizer’s compression rate (bytes/token)? Here we sweep tokenizers, params, and data across compute budgets: [1/N]

0:14

620

104,465

Alex Nichol

Tomasz Limisiewicz retweeted

Alex Nichol @unixpickle

Jun 11

Blog post about my recent optimal tokenizer exploration blog.aqnichol.com/2026/06/10…

3,861

Conference on Language Modeling

Tomasz Limisiewicz retweeted

Conference on Language Modeling @COLM_conf

May 27

COLM 2026 will host 16(!) workshops: colmweb.org/workshops.html CFPs are all online, and deadlines are coming up, so check the CFP of your workshops of interest

16,999

Tomasz Limisiewicz

Tomasz Limisiewicz @TomLimi

May 26

Happy to share that the unprocessed results and code for fitting scaling laws and plotting are now available at: github.com/facebookresearch/…

GitHub - facebookresearch/compute-optimal-tokenization: The repository contains raw data results...

The repository contains raw data results and code for scaling laws fitting and visualization used in "Compute Optimal Tokenization" paper. - facebookresearch/compute-optimal-tokenization

github.com

Tomasz Limisiewicz @TomLimi

May 4

0:14

1,825

Tokenization Workshop (TokShop) @COLM2026

Tomasz Limisiewicz retweeted

Tokenization Workshop (TokShop) @COLM2026 @tokshop2025

May 22

Announcing First Call for Papers: Second Tokenization Workshop 🔡 📣 ▶️ Non-archival submissions of two types: Research papers (up to 9 pages) ▶️ Extended abstracts (up to 2 pages) Submission deadline June 23, 2026 (AoE) Acceptance notification on July 24, 2026 (AoE)

4,104

Margaret Li

Tomasz Limisiewicz retweeted

Margaret Li @margs_li

May 18

MoEs are everywhere, but the design space is confusing: total vs active experts? expert size? shared experts? routing? token dropping? We train >2000 MoE LMs 🫠 to investigate and bring you: 📄🔪🍰 Slicing and Dicing MoEs Tl;dr: it's all about expert size and count [1/9]

377

36,582

Alisa Liu

Tomasz Limisiewicz retweeted

Alisa Liu @alisawuffles

May 14

In SuperBPE we found: as tokenizer compression increases, the compute-optimal ratio of train tokens to model params decreases — and remarkably, corresponds to the same underlying ratio of train *bytes* / param! Our new work makes it official: scaling laws depend on compression.

Tomasz Limisiewicz @TomLimi

May 4

0:14

198

25,730

Tomasz Limisiewicz

Tomasz Limisiewicz @TomLimi

May 14

See you there! 🌉🔠

Tokenization Workshop (TokShop) @COLM2026 @tokshop2025

May 14

TokShop will be at #COLM2026! 🗓️ October 9th, 2026 📍 San Francisco, USA More details and a call for papers coming soon.

353

Grigory Sapunov

Tomasz Limisiewicz retweeted

Grigory Sapunov

@che_shr_cat

May 12

1/ The "20 tokens per parameter" Chinchilla scaling law is flawed. It is an artifact of your tokenizer. Scaling shouldn't be measured in tokens at all. It should be measured in bytes. 🧵

344

19,482

Tomasz Limisiewicz

Tomasz Limisiewicz @TomLimi

May 11

There is life beyond BPE! 🔠🌱🥪 Don’t miss this amazing work from @JulieKallini tackling one of the key challenges of byte-level LLMs: generation speed. Diffusion and speculative decoding come to the rescue, enabling much faster generation with BLT with similar performance.

Julie Kallini ✨

@JulieKallini

May 11

Fast Byte Latent Transformer is accepted to ICML 2026! ⚡🥪 Byte-level LMs promise to free us from subword tokenizers, but decoding one byte at a time is super slow. We make BLT generation more efficient with BLT-D: text diffusion for parallel byte decoding. 1/

1:47

3,104

Artidoro Pagnoni

Tomasz Limisiewicz retweeted

Artidoro Pagnoni

@ArtidoroPagnoni

May 4

Tokens are not a universal unit of data. In our new work on Compute Optimal Tokenization, we show that when adapting scaling recipes across tokenizers, bytes are the more stable unit. And the compute-optimal compression rate is not necessarily what today’s BPE tokenizers use.

Tomasz Limisiewicz @TomLimi

May 4

0:14

8,327

Srini Iyer

Tomasz Limisiewicz retweeted

Srini Iyer

@sriniiyer88

May 4

Extremely excited about our work on Compute Optimal Tokenization! This paper categorically nails down the role that compression plays in compute optimality and recommends how to scale models keeping compression in mind. Cool results on multiple languages too!

Tomasz Limisiewicz @TomLimi

May 4

0:14

1,405

You Jiacheng

Tomasz Limisiewicz retweeted

You Jiacheng @YouJiacheng

May 5

larger compute prefer smaller vocabulary, interesting. 2 follow-up questions: 1. can we decouple in/out tokenization? to isolate the effect of more-input-tokens vs. finer-prediction-granularity. (see also arxiv.org/abs/2504.14992) 2. can we combine it with n-gram embed?

Efficient Pretraining Length Scaling

Recent advances in large language models have demonstrated the effectiveness of length scaling during post-training, yet its potential in pre-training remains underexplored. We present the...

arxiv.org

Tomasz Limisiewicz @TomLimi

May 4

Replying to @TomLimi

These findings hold both for latent tokenizers (BLT) and subword tokenizers (BPE variants). Interestingly, with BPE we observe that at large scale decreasing compression by choosing smaller vocabulary improves performance. [4/N]

3,929

Tomasz Limisiewicz

Tomasz Limisiewicz @TomLimi

May 4

0:14

620

104,465

more replies

Tomasz Limisiewicz

Tomasz Limisiewicz @TomLimi

May 4

At the same time, the optimal compression rate varies across languages and can differ substantially from the compression rate of popular BPE tokenizers. [7/N]

2,367

Tomasz Limisiewicz

Tomasz Limisiewicz @TomLimi

May 4

Find more in the blogpost: co-tok.github.io/ And the paper: co-tok.github.io/paper.pdf Huge thanks to my amazing co-authors at @AIatMeta : @ArtidoroPagnoni @sriniiyer88 @ml_perception @sacmehtauw @gargighosh @LukeZettlemoyer and @uwnlp: @margs_li @alisawuffles [8/N]

2,229