When an LLM writes a proof in natural language, which a mathematician has to check, its calling the mathematician as a tool. So if you haven't checked the output that's not a "proof". Its a bug in your harness, just like if your LLM failed to properly call the Lean kernel.