jina-code-embeddings (0.5B and 1.5B) are compact autoregressive code embedding models for retrieval, technical QA, and cross-lingual similarity. Built on Qwen2.5-Coder backbones, they use last-token pooling with task-specific instruction prefixes (NL2Code, TechQA, Code2Code, Code2NL, Code2Completion), surpassing larger general embedding models on code retrieval benchmarks.
Training
- Training used contrastive InfoNCE (τ=0.05) to pull related query–code pairs closer in embedding space while pushing apart unrelated ones, and Matryoshka representation learning to make embeddings truncatable so users can balance precision against efficiency
- Data: Training combined multiple real and synthetic sources. Real datasets included MTEB code tasks, CoSQA , CodeSearchNet, CommitPackFT, LeetCode, WikiSQL, and SWE-Bench. Synthetic datasets were generated with GPT-4o, covering deep learning framework translations and multilingual extensions of the CodeChef dataset
- Hardware: 4×A100-80GB, 1500 steps (0.5B ≈8.3h, 1.5B ≈12h)
The team compared different pooling strategies for generating embeddings. Last-token pooling consistently gave the best results, outperforming mean pooling and latent attention pooling by about one point on the MTEB code average. For decoder-only models like Qwen2.5-Coder, the last token captures the strongest contextual signal, which explains its edge
Results (MTEB Code Retrieval average)
- JCE-0.5B: 78.41
- JCE-1.5B: 78.72
Compared to jina-embeddings-v4: 74.11, Gemini-001: 74.87, Qwen3-Embedding-0.6B: 73.49