Fresh on arXiv! ๐ Our new paper reformulates tokenisation as a linear program (LP), which we solve to get SOTA tokenisers! As a bonus, this LP allows us to know how close to optimal any tokeniser is! Check it out! ๐
In our new paper, we reinterpret tokenisation as a problem in high-dimensional geometry (100M dims to be precise!), which we can solve efficiently to get a globally near-optimal tokeniser! Our method consistently improves language models over BPE. See ๐งตfor details.