Many-shot in-context learning works best when the prompt mixes a small, tailored slice with a big, reusable backbone.
This paper shows 2 simple ways to do that, cutting cost while keeping or improving accuracy.
The trick is a hybrid selection that stays mostly cached.
Strategy 1 picks 20 demonstrations most similar to the test case, then fills the rest with a fixed random set that can be cached, so a 100-shot prompt becomes 20 tailored plus 80 cached.
Strategy 2 swaps the random cache for a smarter fixed cache, it clusters test samples with k-means to find centroids and caches demonstrations closest to those centroids, which makes the cached part diverse and still reusable.
Pure similarity selection rewrites the whole prompt for every test sample, caching breaks, and cost grows roughly with tokens squared.
These hybrids keep most tokens cached, so cost grows roughly with the cached content, not the full prompt.
In tests on ANLI, TREC, GSM Plus, and MetaTool using Gemini Pro and Gemini Flash, the hybrids matched the best baseline or did better, while being about 2x cheaper at 50 shots and about 10x cheaper at 200 shots.
The ratio is a knob.
In low-data setups, pushing more slots to the tailored slice gave around 3% to 6% accuracy gains over simply using the full small pool.
The core idea is practical, keep a large cached backbone for speed, then add a small personalized slice for relevance.
----
Paper – arxiv. org/abs/2507.16217
Paper Title: "Towards Compute-Optimal Many-Shot In-Context Learning"