- We do something similar to OpenRouter which measures the latency of the different providers, to ensure we always get the fastest results
- Users can cancel a single model stream if it's taking too long
- The orchestrator is pretty good at choosing what models for what task. The actual confidence scoring and synthesis at the end is the difficult part that you cannot do naively, however, the orchestrator plays the biggest part in optimizing cost + speed. I've made sure that we don't exceed 25% extra in cost or time in the vast majority of queries, compared to equivalent prompts in ChatGPT/Gemini/etc.
The reason why this is viable IMO is because of the fact that you can run multiple less-intelligent models with lower thinking efforts and beat a single more-intelligent model with a large thinking effort. The thinking effort reduction speeds up the prompt dramatically.
The sequential steps are then:
1. Ensemble RAG 2. Orchestrator 3. Models in parallel 4. Synthesizer
And retries for low-confidence (although that's pretty optimized with selective retries of portions of the answer).