Let's goo! F5-TTS π
> Trained on 100K hours of data
> Zero-shot voice cloning
> Speed control (based on total duration)
> Emotion based synthesis
> Long-form synthesis
> Supports code-switching
> Best part: CC-BY license (commercially permissive)π₯
Diffusion based architecture:
> Non-Autoregressive Flow Matching with DiT
> Uses ConvNeXt to refine text representation, alignment
Synthesised: I was, like, talking to my friend, and sheβs all, um, excited about her, uh, trip to Europe, and Iβm just, like, so jealous, right? (Happy emotion)
The TTS scene is on fire! π