THE YEAR OF AI CONTEXT IS HERE.
For the past few months I’ve been working on Context for AI models. I genuinely believe this is one of the biggest opportunities in the industry right now.
Because context is messy.
Most models choke at ~100k tokens. And “just do RAG” often feels underwhelming, not because RAG is bad, but because RAG alone is not memory. It’s a search tool.
So… how should we approach context?
I think the best starting point is the human brain.
Short-term memory is where we store the “living” memory. It’s fresh, changing all the time, and it’s what you’re actively holding right now.
Mid-term memory is the working context of recent events, what you can reason over without “looking it up”.
Long-term memory is where experience lives: history, principles, beliefs, the stuff that grounds you. It’s not always “in your head”… you retrieve it when needed.
Now, what tools do we actually have in engineering to approximate that?
We have three primitives:
1) Compaction (summaries / rolling state)
This is why tools like
@cursor_ai /
@claudeai can keep going, they compress the past into something smaller.
Downside: extreme compaction drifts. It loses details. It loses grounding.
2) Compression (real compression, not summarization)
This is the exciting part. Recently, beacon-style compression showed you can compact tokens (think ~8x) and keep inference fast. This opens the door to mid-term context at a scale that feels like “working memory”, not “tiny window”.
3) RAG (retrieval citations)
RAG is great for long-term. It’s not perfect, it’s not always precise, but it’s the best tool we have to search large archives and bring back evidence.
So instead of picking one, I built an integrated, automated memory system that uses all three, the way your brain does.
Here’s the part that feels like a novelty/discovery to me:
Short-term is always there (raw, timestamped, provider-aware “brain log”). When it grows, we build a mid-term working memory via compression/compilation so the model can reason over a much larger “recent context” without stuffing it into the prompt.
And as data keeps accumulating, we maintain a long-term archive that’s searchable with embeddings vector search, optionally reranked for precision.
All behind one endpoint in
@DataGran:
mind_state=short_term → raw “living” memory
mind_state=mid_term → living memory a synthesized answer from working memory
mind_state=long_term → living memory working memory retrieved historical snippets
mind_state=auto → the system picks what’s available
This is still early, but it’s the first time I feel like “context” is becoming an actual product primitive instead of a pile of hacks.
If you want to try it, here’s a tiny cURL walkthrough using the
@firecrawl integration. You will get a feeling of how we automatically load and manage short, mid and long term memory. Ideally Datagran will feel like a plug intelligence into the matrix and we manage the context.
proud-botany-7dd.notion.site…
Final Note: All of these is built on our GPUs and are not live all the time. With that in mind, your initial query may feel very slow. Subsequent queries should take about 1 second.
Also, it is free, which means it may collapse if many users try it at the same time.
That said, this is a very early beta for those who want to try it out.
I would really love it if people like
@karpathy,
@svpino,
@Suhail,
@nico_fiorito among many others, could give it a spin.
Disclaimer: Our Beacons solutions was based on the paper: "Long Context Compression with Activation Beacon" by Peitian Zhang2, Zheng Liu1, Shitao Xiao1, Ninglu Shao2, Qiwei Ye1, Zhicheng Dou2 from the Beijing Academy of Artificial Intelligence.