Voice used to be AI’s forgotten modality - now it's having its big moment: rapid innovation, big funding rounds, major agentic applications
My conversation with
@neilzegh, top AI researcher in the field (
@GoogleDeepMind,
@Meta,
@kyutai_labs) and now CEO of
@GradiumAI
This is a reference episode on all things voice AI 🔥
00:00 Intro
01:21 Voice AI’s big moment, and why we’re still early
03:34 Why voice lagged behind text/image/video
06:06 The convergence era: transformers for every modality
07:40 Beyond Her: always-on assistants, wake words, voice-first devices
11:01 Voice vs text: where voice fits (even for coding)
12:56 Neil’s origin story: from finance to machine learning, with help from
@ylecun and
@soumithchintala
18:35 Neural codecs (SoundStream): compression as the unlock
22:30 Kyutai: open research, small elite teams, moving fast 31:32
Why big labs haven’t “won” voice AI4
34:01 On-device voice: where it works, why compact models matter
46:37 The last mile: real-world robustness, pronunciation, uptime
41:35 Benchmarking voice: why metrics fail, how they actually test
47:03 Cascades vs speech-to-speech: trade-offs what’s next
54:05 Hardest frontier: noisy rooms, factories, multi-speaker chaos
1:00:50 New languages dialects: what transfers, what doesn’t
1:02:54 Hardware & compute: why voice isn’t a 10,000-GPU game
1:07:27 What data do you need to train voice models
1:09:02 Deepfakes privacy: why watermarking isn’t a solution
1:12:30 Voice vision: multimodality, screen awareness, video audio
1:14:43 Voice cloning vs voice design: where the market goes
1:16:32 Paris/Europe AI: talent density, underdog energy, what’s next