Why does a quiet recording matter? Not for aesthetics. For math.
A speech model learns by separating "the signal" (your voice) from "everything else" (background noise). If the noise overwhelms the signal, the model can't isolate the words. If the noise is unusual, like a TV at full volume, music, or an air-conditioning unit, the model learns to associate your language with that noise too. Then, in the real world, it fails on anyone whose room is quiet.
Same logic for articulation. If you over-pronounce, going slow, careful, projecting, the model learns "this is what this language sounds like" and then can't handle a normal conversational pace.
The single most useful rule: record the way you actually speak, in a room that sounds the way your normal rooms sound. The model needs to learn your real voice, not a performance of it.