By freezing all model parameters except for the value codes, we can keep learning under various distribution shifts. This is enabled via localized, input-dependent model updates, which don't affect the prediction from (key, value) pairs retrieved from unalike train samples.
5/6