You might think speech recognition is "solved" with models such as
@OpenAI’s Whisper, but it's not. Natural conversations with distant microphones still lack effective solutions.
To illustrate, on our newly released NOTSOFAR meeting benchmark, Whisper large-v3 with head-mounted mics achieves 9.3% WER (word-error-rate), yet on audio from a distant mic it climbs to 37.4% WER. The culprits are reverberation, noise, and overlapping speech, which interfere with the source signal.
What's the missing ingredient? We believe it's datasets.
The problem is not amenable to web scraping. Benchmarking datasets are scarce given their complex collection process. Microphone arrays, useful for speech separation, are rarely featured in labeled datasets, necessitating simulation to teach neural networks to utilize such arrays.
To bridge the gap our team at
@Microsoft has released a benchmarking dataset of 280 recorded meetings, and a 1000-hour simulated training set synthesized for real-world generalization.
Join our challenge "NOTSOFAR: Distant Meeting Transcription with a Single Device", part of CHiME-8, to explore these resources and advance the field.
Details and registration:
aka.ms/chime8
Code and datasets:
aka.ms/notsofar