It's very cool that Apple shipped a 20B parameter on-device.
You can't put 20B parameters in RAM at any reasonable precision. To make it work they are using pretty exotic architecture by today's standards.
A small model predicts from the query (or prompt) which experts to load from Nand into RAM. The key distinction from a typical MoE is that you do this once per query and then generate all the tokens with the same experts (instead of switching the experts for every token).
Introducing Claude Fable 5: a Mythos-class model that we’ve made safe for general use.
Its capabilities exceed those of any model we’ve ever made generally available.
It's good to read and understand countering view points on important subjects. I understand how everyone is hyped about AI Coding agents, and I use them and they are pretty cool! But let's read about this differing opinion, just to challenge our confirmation bias.
This is from @realGeorgeHotz Read it for fun, if nothing else.
geohot.github.io/blog/jekyll…
Yes, the term "token anxiety" will soon enter the Psychology diagnostic and statistical manual, and there will be a few dozen papers on it.
Therapists are going to have a bright future.
The classic example: ask an LLM how many R’s are in “strawberry.” LLMs used to get it wrong. That’s not the model failing at counting. It’s the model not operating on letters directly, only token IDs that happen to spell out a word a human would split letter by letter.
0xkato.xyz/how-llms-actually…