The request path: Client (laptop on Tailscale) → Tailscale Aperture (AI gateway — auth + routes by model name) → llama-swap → vLLM → GPU
What I like about it: - Access runs over Tailscale, so it's end-to-end encrypted and gated by OAuth. No open ports and no reverse proxy to babysit.
- llama-swap loads models on demand: if the requested model isn't running, it starts a vLLM child process, and if a model sits idle for ~5 min, it kills it to free VRAM. Useful when juggling models on one box.
- vLLM handles inference (currently Qwen3.6 27B).
I can also just SSH in to work directly on the GPU — adding models, fine-tuning, and so on.