Across 10K+ of our agent transcripts from benchmarking against OpenAI's EVMBench, we saw zero refusals. In the closed-frontier models, the refusal you hit is mostly a separate content classifier, or a system prompt, not so much the model itself. Breadth (more cheap agents) beats a bigger model, but it puts more requirements on context engineering