The Voice Agent chains three models in the browser: Whisper for STT → a local LLM → Kokoro/SpeechT5 for TTS. All inference runs on-device via WebGPU. The latency isn't amazing yet, but the fact that it works at all with zero backend is kind of wild.
The landing page has an auto-playing demo that generates speech locally as soon as you visit — you'll hear it typewrite and speak three sentences. That was important to me because "runs in your browser" sounds like marketing until you actually hear it happen.
Happy to go deep on the WebGPU inference pipeline, model conversion process, or anything else.