mckennameyer7 hours ago
We tested GPT-5 and Gemini Flash 3 at low, medium, and high effort on 169 instances with human-verified answers, scored against a frozen offline web corpus using Deep Research Bench. High effort consistently scored worse than lower thinking levels for both models. Methodology and raw data: https://everyrow.io/docs/notebooks/deep-research-bench-paret... (edited)