Instead of stuffing raw documents into every call, you index your data once and query it with simple prompts like “how does X work?” to get grounded, cited answers from your own data. Your main agent can also delegate low-level RAG questions to a smaller local model for token efficiency, while a stronger frontier model handles higher-level reasoning.
That makes it a good fit for setups that pair a local model such as Gemma 4 with a more capable orchestration model. Tokens go down, latency improves, and the whole system becomes more efficient. qi can also run fully offline, so you keep full control over your data, models, and infrastructure.
You can plug in whatever model stack you prefer, whether that is Ollama, LM Studio, llama.cpp, MLX, or cloud APIs, which makes it easy to balance cost, speed, and quality. It also integrates cleanly into agent workflows, including as a Claude Code plugin, so SOTA models can delegate retrieval and lightweight knowledge queries instead of wasting context.