One thing I've been exploring in the voice AI space is going beyond just speech output — making the voice agent actually interactive and able to navigate context on a website in real-time. For example, having the voice agent understand what page the user is on and guide them through actions (like booking a demo or finding pricing), not just read text aloud.
The latency challenge is real though. We found that keeping end-to-end response under 500ms requires a pipeline approach: streaming STT → intent classification → pre-cached response segments → TTS. Buffering even 200ms of audio before playback makes a huge perceptual difference.
What's your latency like with the MCP speak tool? And are you thinking about bidirectional voice (listening + speaking) or mainly output?