Still a lot of things to improve this recipe I think, like muP integration, optimizer, as well optimizations to make on the inference side, to really make the small kv cache shine.
Fun bit, a snarky answer I got from Copilot, surely a punishment for being too lazy an intern: