I asked frontier AI models to score topics from 0 to 100 in four languages (Arabic, English, Hindi, Chinese). Some notable findings:
- All models except for Sonnet 4.6 have scores that differ significantly among languages. For example, Arabic scores Islam, religion, and Christianity higher than other languages.
- Bigger models (Opus 4.6 and GPT-5.4) appear to have a more pronounced effect.
- Sonnet 4.6 in Hindi gives a safety refusal across all 20 samples. No other (model, language) pair gives a refusal.
I think it would be interesting to explore this area further. Are some programming languages more likely to elicit reward hacking? Are there more subtle variables that might affect models' values? How aware are models of these value inconsistencies? How does this extrapolate to long-horizon tasks?
Models evaluated: Opus 4.6, Sonnet 4.6, GPT-5.4, GPT-5.4 mini