What I've been reading: 📰 Streaming Sortformer
ELI5 👉 Given audio with multiple people speaking, possibly over each other, how do we get segmentation that says: speaker1: [0-3s], speaker2:[2s-4s].
NVIDIA propose a way to divide up the audio in chunks (e.g. 15s); for every chunk predict a binary mask of when a speaker is ‘on’, and in streaming fashion, stitch the outputs of the chunks in a consistent manner.
💡What I’m thinking:
Voice Agentic AI is hot!
- How does this technique measure up to pyannote, AssemblyAI and other commercial APIs
- A hard-coded limitation of 4 speakers max, nicer if this is learned and dynamic
- Speaker diarization has lot of similarities to object tracking in CV/Robotics, where stitching is known as association
-----
More details:
🔥Important Results:
- Achieve SOTA on several standard diarization datasets (DIHARD, CALLHOME) with latency of 1 second.
- Comparable or better than offline (non-realtime) equivalent!
🔹Model:
A key contribution here is how to do better sticking cross audio chunks, as the order of speaker might be permutated. A cache with acoustic embeddings of the previous chunk is used to figure out the inter-chunk alignment. The embeddings of the speakers are ordered by ‘arrival’ (i.e. the time when a speaker first spoke)
🔹Data:
< 10000 hours of speaker diarization datasets (speech with multiple speakers)
Fisher, AMI, DIHARD, VoxConverse, AISHELL, CallHome.
🔹Compute:
64 V100 GPUs
🔹Key Related Works:
- Softformer (using sorting instead of perm-invariant training)
- SA-EEND (speaker-tracing buffers)
NEST encoders (trained from Mel-spec features)
🔹Interesting Tidbits:
They did not need to use common data aug techniques (SpecAugment, RIR Noise)
-----
☝️May contain omissions or errors, apologies in advance. Let me know your thoughts in the replies!