Curious if BrowseComp accounts for domain-specific retrieval or if it's mostly general web search.
BrowseComp is a web browsing benchmark, not a knowledge or reasoning test. It evaluates whether AI agents can navigate the open web to find specific, obscure information.
Questions are “inverted” - authors start with a fact and work backwards to create a question that’s easy to verify but extremely hard to solve through search.
Brute-force search doesn’t work. The search space is deliberately massive - thousands of papers, matches, events - making systematic enumeration impractical.
Grading uses an LLM judge with a confidence score, creating an interesting meta-layer where one model evaluates another’s certainty.
This benchmark reveals the gap between “can answer questions” and “can do research” - the exact capability that separates chatbots from useful AI agents.