The write-up covers why VAD alone fails for real turn detection, how voice agents reduce to a minimal speaking/listening loop, why STT → LLM → TTS must be streaming rather than sequential, why TTFT matters more than model quality in voice, and why geography dominates latency. By colocating Twilio, Deepgram, ElevenLabs, and the orchestration layer, I reached ~790ms end-to-end latency, slightly faster than an equivalent Vapi setup.