I always felt CFG was a patch to fix a training problem we didn't yet understand.
Training with only normal distributed noise teaches the model that each step will have a perfectly normalized error from the previous step, which is not the case. Therefore, it is incapable of correcting the errors it created from the previous steps during generation, which leads to distorted generations as these uncorrected errors compound with each step.
We currently correct this by applying 2 pass CFG to amplify the model's correction predictions from a base, which helps correct these errors at each step, but this leads to the model over correcting leading to over corrected images that look oversaturated. That classic AI look.
I tested fine-tuning Z-Image while providing a balanced random augmentation of the noise and it appears to have taught the model to overcome these errors which led to the model no longer needing CFG and also producing better quality images in the process.
These samples are from training a LoRA on Z-Image with a batch size of 2 for 3,000 steps. I am going to do a significantly longer fine-tune using the same process this weekend.