Humans have an average of 200-250 ms of latency when speaking to each other.
This voice model is even faster: only 110 ms of latency!
Open-weights ←You don't need to pay anyone to use it.
8B parameters ← Small and cheap to host and run.
You can run it locally by cloning the Github repository. They published the instructions in the repository below.
Open models keep getting stronger!
Today, we’re excited to introduce Miso One, the most emotive voice model in the world.
Miso One is an 8-billion-parameter text-to-speech model for highly expressive speech generation. It emotes like a human and responds faster than a human, with just 110 milliseconds of latency.
We’ve open-sourced the model weights, with API access coming soon.
Hear how Miso One sounds in the thread below.