Last week’s Post-Transformer debate post raised one question: Can long term memory become part of the architecture?
It points to one promising mathematical idea behind Post Transformer AI: Linear attention in high dimension with persistent state.
In a standard Transformer, memory is handled through caching context.
The model keeps previous keys and values in small dimension d, then attends over them. But this is still token history.
BDH (Dragon Hatchling) – one of the Post-Transformer architectures, takes a different route.
The paper describes BDH's state space as fixed and large, with the macro interpretation of associative memory, like KV cache, but organized differently.
Each layer has a persistent state matrix: ρₗ ∈ Rⁿˣᵈ
Here:
n = neuronal or concept dimension
d = low rank synaptic dimension
d << n
The key idea is that state is aligned to neurons, in high dimensional space (n in the order of billions).
A Transformer stores token history.Whereas BDH-GPU (a tensor-friendly version of the BDH architecture) evolves state, similar to State-Space Models.
This is where the brain analogy becomes useful. The brain does not append every experience into a longer transcript. It has a large bounded substrate of neurons and synapses, where experience changes connections sparsely and with high parallelism.
BDH GPU expresses a related idea computationally:
not memory as a longer context window,
but memory as a large, evolving internal state.
Why it matters:
– no Transformer style hard context window. practically enabling a infinite context window in a reasoning model.
– linear attention in a large neuronal dimension
– sparse positive activations
– persistent state instead of only token history
The deeper insight:
Long horizon reasoning may not come from storing more tokens.
It may very well come from better state dynamics.