STT time to first token is ~300ms. ~20 second audio takes less than 1 second to be converted.
TTS time to first token is ~700ms. ~20 second of audio is generated under 2 seconds.
I feel this is also why you don't see the same degree of hype as you would with the other players. When you are taking an application-driven approach to launching AI products, hype matters less than targeting decisionmakers and showing that your product directly aligns with their outcomes.
If you add in the commercial aspect you pointed out, TTS/STT becomes even more important.