Benchmarking STT providers on real calls (Deepgram 15.9% vs. OpenAI 39.8% WER)(twitter.com)

9 pointsby pstrav4 hours ago3 comments

pstrav4 hours ago
We are building AI agents for trades businesses (HVAC, electrical, plumbing).
We tested 13 providers on 100 real customer calls with:
- Background noise (vans, job sites, crying babies)
- UK regional accents (England North/South, Scotland, Ireland)
- Critical info: postcodes, addresses, phone numbers
- Variable turn length (1-5 words vs 16+)
Results: 2.5x performance gap
```
    Best: Deepgram Flux (15.86% WER)
    Worst: OpenAI Whisper (39.78% WER)
```
Interesting findings:
(1) Postcode recognition was hardest across ALL providers (50%+ WER).
(2) Regional variance was massive. Ireland accents destroyed most models (20-30% higher WER than Southern England).
(3) Short confirmations ("yeah", "ok") actually had worse WER than long explanations. Counter-intuitive but likely due to less context for the language model.
Full breakdown with graphs: https://x.com/pstrav/status/2018416957003866564
Context: We're Elyos AI (YC S23), handling 100k+ calls/month for trades businesses across the world.
awooga4 hours ago
[dead]
asaf_lerner4 hours ago
[dead]