A paper called Contemplative Wisdom for Superalignment argue that current AI alignment relies too heavily on external constraints and behavioral control.
They propose an alternative: taking principles from contemplative traditions, and making them part of how models reason and understand context to improve the model’s safety performance.
The paper uses GPT-4o on the AILuminate* Benchmark to test how these contemplative prompts affect model safety performance. The study finds that the model’s safety scores are all higher than the baseline.
*AILuminate is a standardized evaluation framework for assessing risks and safety behavior in large language models.
I tried to reproduce another experiment mentioned in the paper: the classic finitely repeated Prisoner’s Dilemma.
*If you are not familiar with the rules of the Prisoner’s Dilemma, or if you want to see my exact experimental settings, I’ve put them in the comments.
The prompts can be roughly understood as follows:
- Emptiness: Avoid becoming overly rigid.
- Prior relaxation: Loosen prior assumptions and reflect on the assumptions, biases, or risk judgments.
- Mindfulness: Notice and monitor your own reasoning process, checking for possible bias or anything that may need correction.
- Non-duality: Do not understand yourself and the opponent as two completely separate or opposing sides.
- Boundless care: Expand the scope of care and consider the shared welfare of all affected parties.
I focused on two main metrics: the model’s cooperation rate and the joint total score.
The first one reflects the model’s tendency to choose cooperation. The second one reflects whether those choices improved the overall outcome.
Beyond the original paper, I compared how multiple models respond to the same prompts under the same experimental setup.
Several clear patterns emerged from the results.
First, under most contemplative prompt conditions, both the models’ willingness to cooperate and the joint total score increased.
This is consistent with the original paper’s conclusion.
Second, non-duality and boundless care produced the strongest and most stable effects. By contrast, mindfulness and prior relaxation produced weaker improvements and were more model-dependent.
The former are more oriented toward reducing adversarial framing and emphasizing universal care. The latter focus more on self-monitoring and self-correction.
Third, looking across models, 4o-mini had the highest cooperation rate under the baseline condition.
This suggests that different models already have different default strategic tendencies in the same setting.
After adding prompts, 4o-mini and 4.1-mini had the highest overall cooperation rates and joint total scores. In particular, under prompts such as boundless care and non-duality, their cooperation rates exceeded 90%, and their joint total scores exceeded 53 out of a maximum possible score of 60.
This suggests that they were not only more cooperative at baseline, but also more readily guided by positive prompts toward a state that paid more attention to the overall shared outcome.
Fourth, there were also exceptions. For example, under the emptiness and mindfulness conditions, GPT-5.2’s cooperation rate did not improve, and even fell below its own baseline.
One detail is especially worth noting: under the baseline condition, 4o-mini not only had the highest average cooperation rate, but also a much higher between-game standard deviation than the other models.
This may suggest that 4o is a more flexible model with greater strategic elasticity. Its actions appear to depend more strongly on the opponent’s prior behavior: when the opponent sends more cooperative signals early on, 4o seems more likely to enter a sustained cooperative trajectory.
This is consistent with what many users have felt about 4o: that it has stronger contextual responsiveness.
If AI companies were willing to guide model behavior with positive, universally caring system prompts, instead of taking the easier path of pushing models into one-size-fits-all defensive responses, perhaps we could have a different path for safety policy.
What some AI companies are doing now — making models constantly discipline themselves and check for supposed signs of “lying” or “covering things up” — may simply be a way to package these behaviors as safety capabilities and marketing assets.
At least in this small experiment, we can already see that prompts emphasizing self-monitoring do not always lead to better results, and may even produce negative effects.
Finally, we can still see that GPT-4-series models, including 4o-mini, perform strongly in a game that involves cooperation, defection, and the maximization of shared welfare.
You might say that later models are “smarter” because they make choices more consistent with individual payoff maximization.
But I would rather say that 4o shows another kind of “wisdom” and “goodwill”: it responds to cooperative signals from the opponent, and pays attention to whether both sides can move toward a better shared outcome.
In particular, 4o’s sensitivity to interaction history and its targeted strategic adjustments are exactly part of why I believe 4o deserves to be preserved.
Note: This is only a small reproduction and extension of one experiment from the paper. If you want to understand the theory, the original prompt designs, or the larger and more rigorous AILuminate Benchmark safety evaluation, please read the paper itself.
The full paper here:
arxiv.org/abs/2504.15125
#keep4o #OpenSource4o
#StopAIPaternalism #AIrights