Overview of our recently launched AA-WER Streaming benchmark, measuring streaming Speech to Text models on accuracy and latency for voice agent use cases
Streaming Speech to Text (STT) powers real-time transcription in voice agents and live captioning, where models must balance accuracy against speed. Fast transcripts keep responses feeling natural and free up the response-time budget for reasoning and tool calls. Accuracy matters too, since errors can compound downstream.
Streaming STT models transcribe audio as it is fed in, sharing outputs continuously, unlike offline (batch) models that process the entire file at once and are typically slower.
Models from Cartesia, ElevenLabs, and Deepgram sit on the accuracy-latency Pareto frontier. Cartesia Ink-2 leads on final transcript accuracy at 3.59% WER (210ms), closely followed by ElevenLabs Scribe v2 Realtime at 3.64% WER (140ms). Deepgram Flux is fastest at ~20ms on final transcript latency (7.36% WER).
In this video, Kiriill Butler, Member of Technical Staff at Artificial Analysis, walks through the benchmark and key results.