Holy shit. The biggest unsolved problem in AI agents isn't reasoning it's memory. Your agent forgets everything between sessions.
MemFactory just open-sourced the first unified framework for training agents to manage their own memory via reinforcement learning. Extract.
Update. Retrieve. All trainable. All modular. Runs on one GPU.
Every AI agent built today is amnesiac by design. It can reason. It can plan.
It can use tools. But the moment a session ends, everything it learned about you, your preferences, your context, and your history disappears.
The next conversation starts from zero.
This is not a minor inconvenience it is the fundamental barrier between AI assistants and AI agents that actually work over days, weeks, and months. The field has known this for years.
The solutions have been fragmented, task-specific, and impossible to combine. Memory-R1 handles structured CRUD operations on a memory bank. MemAgent compresses history into a fixed-length recurrent state.
RMM optimizes retrieval through retrospective reflection. Each works.
None can be combined. Each lives in its own repository with its own data format, its own training pipeline, and its own set of assumptions. MemFactory ends that fragmentation.
The core insight is that memory management is a decision problem, not a retrieval problem. Current systems treat memory as a database store things, look things up.
MemFactory treats memory as a policy an agent that learns when to extract new information, when to update existing memories, when to delete contradicted facts, and what to retrieve for any given query.
That policy is trained via reinforcement learning, specifically Group Relative Policy Optimization, which eliminates the need for a separate critic model and cuts training memory requirements in half.
This matters because memory-augmented agents already have saturated context windows from dialogue history and retrieved content.
The last thing they need is a training algorithm that doubles the memory footprint.
The architecture is four layers that compose like Lego blocks.
The Module Layer decomposes memory into atomic operations:
> Extractor parses raw conversations into structured memory entries,
> Updater decides whether each new piece of information should be added, modify an existing entry, delete a contradiction, or left alone,
> Retriever fetches relevant memories using semantic search or LLM-based reranking.
The Agent Layer assembles these modules into a complete memory policy and executes rollout trajectories during training.
The Environment Layer standardizes any dataset into the format the agent needs and computes reward signals format rewards for structural compliance, LLM-as-a-judge scores for quality.
The Trainer Layer runs GRPO to update the memory policy based on those rewards. Every module plugs into every other module through standardized interfaces.
You can swap the retriever in Memory-R1 for an LLM-based reranker without touching anything else.
The results from training a MemAgent-style architecture through MemFactory on two base models:
→ Qwen3-1.7B base: average score 0.3118 across three evaluation sets
→ Qwen3-1.7B after MemFactory RL: 0.3581 14.8% relative improvement
→ Qwen3-4B-Instruct base: average score 0.6146
→ Qwen3-4B-Instruct after MemFactory RL: 0.6595 7.3% relative
improvement
→ 4B model gains hold on out-of-distribution benchmarks the memory policy transfers to unseen tasks
→ Entire training and evaluation pipeline runs on a single NVIDIA A800 80GB GPU
→ 250 training steps on simplified long-context data no massive compute cluster required
→ Three ready-to-use agent architectures out of the box: MemoryR1Agent, MemoryAgent, MemoryRMMAgent
> The out-of-distribution result is the one that matters most. The 1.7B model improved on in-domain tasks but slightly degraded on the OOD benchmark the learned policy was too specific to the training distribution.
The 4B model improved on both.
This is the capability threshold at which a memory policy becomes genuinely general: large enough to abstract principles about what information is worth keeping, not just pattern-match on training examples.
A memory agent that only remembers the right things in familiar situations is not much better than no memory at all.
The 4B result suggests that threshold is reachable with models that fit on a single consumer GPU.
> The fragmentation problem MemFactory solves is deeper than it looks. When every memory implementation has its own pipeline, researchers cannot compare approaches fairly.
Two systems that nominally differ by one design choice say, CRUD operations versus recurrent state compression actually differ simultaneously in data format, reward structure, training algorithm, and evaluation protocol.
Nobody knows which choice caused which outcome.
MemFactory puts all three major paradigms under the same training loop, the same reward computation, and the same evaluation framework.
Now you can actually isolate what matters.
Your agent forgets everything. This is the infrastructure to fix that.