Wow. This Nature paper is incredibly interesting and worrying.
The authors show that LLMs may pass subliminal properties to other models.
For example, if an LLM likes owls and is used to generate training data for another model, then the other model may also end up liking owls.
This happens subliminally, meaning that it can persist even after semantically related mentions are carefully removed from the training data.
They also find a similar effect with misaligned behaviour, raising the concern that unsafe tendencies may be passed from model to model in ways that are hard for humans to detect.
This seems to happen especially when teacher and student share the same, or a behaviourally matched, base model.
This opens a quite disturbing possibility: models may inherit hidden traits from other models, even when those traits are not explicitly visible in the training data.
*
Paper in the first reply