LLM2Vec-Gen represents a major paradigm shift for embeddings/retrieval. Why encode the query when the LLM already knows what to look for and can directly produce an embedding for it?
Best part: it’s self-supervised, and it does all of this while the LLM remains completely frozen.
Think about it: "solve x² 3x − 4 = 0" has zero reasoning in it. But the LLM's response does. By encoding the response, the embedding captures the reasoning --- and the better the LLM reasons, the better the embedding. This is why our results scale with model size. As LLMs get smarter, our embeddings automatically get better.
LLM2Vec-Gen is also the first demonstration of the promise of
@ylecun's JEPA for text embeddings. The alignment loss is JEPA — predict in representation space, not token space. The reconstruction loss goes beyond --- it keeps embeddings decodable.
This paradigm shift opens new frontiers:
🔬 Can we build a full JEPA for language where the teacher and student are the same LLM?
⚡ Can LLMs reason in compressed space without ever generating text?
🤖 Can agents reason in compression tokens and carry that directly into retrieval?
💬 Can agents talk to each other in compression tokens instead of text --- dense, fast, and still human-readable?
LLM2Vec-Gen is a first step toward all four.
Your LLM already knows the answer. Why is your embedding model still encoding the question?
🚨Introducing LLM2Vec-Gen: your frozen LLM generates the answer's embedding in a single forward pass — without ever generating the answer. Not only that, the frozen LLM can decode the embedding back into text.
🏆 SOTA self-supervised embeddings
🛡️ Free transfer of instruction-following, safety, and reasoning