We convert LLMs into chatbots by using markers, eg User and Assistant:
User: What's the capital of France?
Assistant: Paris.
User: What did I just say?
the "I just said" attribution works because the tokens are cleanly labeled with role markers. But strip the markers and flatten the context, and the model has no principled way to tell apart who produced what. Worse: after the conversation is summarized and compressed for long-term memory, those role markers often disappear, and the model is left with a blur of "things that were said" without clear provenance.
This is exactly the pathology the Ortega paper (
adaptiveagents.org/_media/un…) was designed to prevent. Without distinguishing between (actions aka interventions) and observations, the model treats its own past outputs as indistinguishable from the world's outputs. In other words, it has no agency or equivalently it is not learning what it can cause.
How do we fix this?
Option 1 is to train the model with provenance attribution as an explicit auxiliary task. Every time the model encounters information in its context, give it a supervision signal about the source. Over time, this should bias the internal representations toward encoding provenance even when surface markers are absent. This is a version of multi-task learning applied to self-world distinction.
A more ambitious option 2 (advocated by folks like
@yudapearl), is to train the model to reason about its own causal role in producing information. Given a memory of a past interaction, can the model counterfactually ask "would this information exist if I hadn't acted?"
I'm curious as to how we could go about implementing this more ambitious option 2?
Has anyone tried option 1?
What else have people tried to solve this problem?
In RL as well as
@AdaptiveAgents' agency approach, it is assumed that the distinction between the agent and the world is given. However, we humans don't know what are our actions when we are born. We learn this awareness of self, of other selves, and build on this to arrive at causal reasoning.
I feel knowing what is one's action, owning it, is important to understand for Safety in AI.