I work at Speechmatics, so I'm obviously biased here, but I think the most interesting part of this benchmark isn't the results, it's the methodology. Semantic WER is a meaningful step forward. Traditional WER penalises differences like "gonna" vs "going to" that an LLM would treat identically. This benchmark instead asks: "would an LLM agent respond differently to these two transcriptions?" That's a much more useful question when STT is feeding into a voice agent pipeline.
The Pareto frontier analysis is also worth looking at. There's a real latency/accuracy tradeoff across providers, and the benchmark makes it visible rather than pretending one metric is all that matters.
Full benchmark tool is open source if anyone wants to run it with their own config: https://github.com/pipecat-ai/stt-benchmark