Adds a 3 layer GELU MLP to predict the residual latents between each token in each SGD batch. Loss function is L1 between latents and the predicted token distribution (KL)
Next-token prediction is myopic. What if transformers learn to predict their own next latent state?
๐ We present ๐ก๐ฒ๐
๐-๐๐ฎ๐๐ฒ๐ป๐ ๐ฃ๐ฟ๐ฒ๐ฑ๐ถ๐ฐ๐๐ถ๐ผ๐ป (๐ก๐ฒ๐
๐๐๐ฎ๐): a self-supervised learning method that teaches transformers to form compact world models for reasoning and planning. It also unlocks up to 3.3x faster inference via self-speculative decoding! ๐