Perplexity just open-sourced the tool they use internally to cut their own CPU usage by 5-6x. 🤯
It's a rebuilt tokenizer called pplx-unigram. Before any AI model can read your text, something has to chop that text into small pieces first.
That chopping runs on the CPU, not the GPU where the model actually lives.
It covers the search, ranking, and retrieval models that power most AI apps today.
Here is why this matters now.
AI models on GPUs have gotten so fast they now finish in single-digit milliseconds.
So the boring step before them, the text-chopping, quietly became a real chunk of the total time. Nobody was looking at it because everyone was busy making the models faster.
Perplexity looked.
They found the standard tool almost everyone uses was wasting effort on every single request, creating throwaway data and chasing scattered memory.
So they rebuilt it from scratch.
Everyone optimizes the model. Perplexity optimized the step before it.
The result: 5x faster than the HuggingFace tokenizer almost everyone runs, 2x faster than the C standard, and 5-6x less CPU in their own production stack.
Same exact output. MIT licensed. Free.
For years the tokenizer was treated as a solved problem nobody needed to touch.
Perplexity just proved it was hiding a 5x speedup. In the open.
Worth a look if you run any search, ranking, or retrieval models at scale.
We're open-sourcing the Unigram tokenizer we rebuilt to reduce CPU utilization by 5-6x.
Small rerankers and embedders run in single-digit milliseconds on GPU, making CPU tokenization a meaningful share of total latency.
github.com/perplexityai/pplx…