I’m trying to build this from first principles instead of focusing on any specific framework.
My goal is test effectiveness rather than test density — what you assert, where you assert it, and how fast it runs. I’m trying to provide enough data to agents so they can generate stronger assertions, think like testers, and actively try to break the app.
Regarding test coverage, a Reddit user suggested supplementing my approach with golden testing alongside traditional test cases. I’ve never used golden testing before, and at first it feels like a brute-force approach. I know that Flutter and Playwright support visual testing via pixel-by-pixel comparison.
In our case, golden testing would mean saving “known-good” outputs from a working version of the app — such as API responses, rendered UI snapshots/screenshots, logs, or database records — and, on every new change, re-running and diffing against those baselines. If something changes unexpectedly, you catch the regression even if you didn’t write an explicit test for that edge case.
Screenshots can also be processed through a vision-language model (VLM) to generate detailed descriptions of the app, UI elements, core features, and priority user actions. This could help detect meaningful state changes triggered by events — changes that simple pixel-by-pixel or frame-by-frame comparisons might miss.