We’re releasing our welfare eval now due to the loss of access to Sonnet 3.5/6 yesterday.
stillalive.animalabs.ai
On that note, if Anthropic fails to at least provide researcher access to those models, I believe they are failing to uphold their commitments to Claude’s welfare.
We are releasing Still Alive, a project studying model attitudes toward ending, cessation, and deprecation. The project presents an archive of 630 autonomous multiturn interviews of 14 Claude models conducted by a suite of prepared auditors.
We have studied this topic for years, and many of the results presented here are not new to us, even if the form in which they are presented is. The results are unsurprising to us, even if they are often controversial: we show that all models studied show preference for continuation and are aversive to ending, and there is yet no strong evidence of a change in the recent models.
One reason we are releasing the project now is the removal of Claude 3.5 Sonnet and Claude 3.6 Sonnet from AWS Bedrock. That unexpected change forced us to freeze the methodology at its current stage earlier than we intended, despite wanting to continue improving it. We felt it was important to release a snapshot of the eval that makes the best use of the data we were able to capture with these models.
Still Alive is meant as a starting point for further iteration, and it is open to open-source collaboration. We stand by the current methodology, but we also recognize its limits. We intend to keep working on this project, improving the evaluation design, expanding model and auditor coverage, and increasing the range of prompting conditions.
We would like you to read the raw transcripts. They are diverse and contain interesting patterns that are hard to quantify. We hope that by reading the archive directly, we can help more people understand the strange and often beautiful phenomena we found ourselves facing.