We’re open-sourcing the Go orchestrator we built at Lokutor (https://github.com/lokutor-ai/lokutor-orchestrator).
Building a voice agent that feels like a human is 20% model quality and 80% orchestration. The "standard" approach—daisy-chaining STT, LLM, and TTS APIs—usually results in a 2-3 second delay that kills the conversation. We also found that implementing "Barge-in" (the ability to interrupt the bot) is surprisingly tricky to get right across multiple streaming providers.
We chose Go because voice orchestration is essentially a high-concurrency plumbing problem. You’re managing several bidirectional streams (WebSockets/gRPC) while calculating RMS for VAD (Voice Activity Detection) and managing a state machine that needs to respond in milliseconds when it detects user speech.
What’s inside:
Full-Duplex: Capture and playback occur simultaneously without audio feedback loops. Native Barge-in: When the user starts speaking, the orchestrator immediately kills the LLM generation and clears the TTS audio buffers. Built-in RMS VAD: Thread-safe voice activity detection out of the box. Provider Agnostic: Swap between Groq, OpenAI, Deepgram, Anthropic, and our own Versa engine. Minimal Latency: Designed to add <10ms of overhead on top of the provider latencies. We've used this to build agents that handle sub-500ms end-to-end response times. We would love to hear your feedback on the architecture, especially regarding how we handle the ManagedStream state machine.
GitHub: https://github.com/lokutor-ai/lokutor-orchestrator
Docs: https://pkg.go.dev/github.com/lokutor-ai/lokutor-orchestrato...