Oh. Almost your daily life / daily struggle / daily fight with AI, when it comes to testing things in production.
This was not test, but this is so typical of AI, especially Claude, like destroying docker builds, emptying databases, removing volumes if he thinks there is a bug, the default reset for it is -> delete everything.
So stupid. And yes, Claude is the one who is the fastest to not follow the rules. I tried to add a set of rules to the agents, plus global rules, plus multi agent setups where before any CLI is called towards production setup, to go through a reviewer. It is impossible sometimes.
However, small Local LLMs are hardcore followers of the rules. Gemini is in between, but it stands more on the rule following. Gemini 3.1 Pro High follows the best, Gemini Flash has its moments, when it loses it, but even flash improved a lot in the last 2 weeks.
I think that in software we still need the true handcoded solutions and scripts and you do not give the agent some API Keys through which he can call anything, you give an API key through which he can call some internal scripts, and the internal scripts will verify everything. Like you limit what the AI can do and the internal scripts are the old school scripts, maybe written even manually.
Maybe this is called basic software engineering, with public and private functions. A lot of vibe coders will go through universities to learn basic coding and rules.