OpenAI is expected to demo a real-time voice assistant tomorrow. What does it take to deliver an immersive, or even magical experience?
Almost all voice AI go through 3 stages:
1. Speech recognition or "ASR": audio -> text1, think Whisper;
2. LLM that plans what to say next: text1 -> text2;
3. Speech synthesis or "TTS": text2 -> audio, think ElevenLabs or VALL-E.
Last year, I made the figure below to show how to make Siri/Alexa 10x better. However, naively going through 3 stages results in huge latency. User experience falls off the cliff if we have to wait 5 seconds for *each* reply. It breaks the immersion and feels lifeless even if the synthesized audio itself sounds real.
Natural dialogues fundamentally don't work like this. We humans
> think about what to say next at the same time as we listen & speak;
> inject "yes, hmm, huh" at appropriate moments;
> predict when the other person finishes and immediately take over;
> decide to talk over the other person organically, without being offensive;
> handle interruptions gracefully. Currently, AI assistants either cannot be interrupted (super frustrating) or simply stop when they detect an audio event and lose train of thought;
> engage in group chat. We are so good at multi-agent conversations.
It's not as simple as making each of the 3 neural nets faster, sequentially. Solving real-time dialogue requires us to rethink the whole stack, overlap each component as much as possible, and learn how to make interventions in real time.
Or perhaps even better - just have 1 NN mapping audio to audio. End-to-end always wins.
I'll sketch out how to design such a model and its training pipeline. Meanwhile, let's wait and see how far OpenAI pushes it!