do benchmarks reflect that gap in english region?
| Domain | Benchmark | Model (reasoning level) | | | |
|--:-------------------|------------------------|-------------------------|-----------------------------|--------------------------|-----------------------|
| | | OpenAI GPT-5.5 (xhigh) | OpenAI GPT-5.4 mini (xhigh) | Anthropic Opus 4.6 (max) | DeepSeek V4 Pro (max) |
| Cyber | CTF-Archive-Diamond | **71%** | 32% | 46% | 32% |
| Software Engineering | SWE-Bench Verified* | **81%** | 73% | 79% | 74% |
| | PortBench | **78%** | 41% | 60% | 44% |
| Natural Sciences | FrontierScience | **79%** | 74% | 72% | 74% |
| | GPQA-Diamond | **96%** | 87% | 91% | 90% |
| Abstract Reasoning | ARC-AGI-2 semi-private | **79%** | – | 63% | 46% |
| Mathematics | OTIS-AIME-2025 | **100%** | 90% | 92% | 97% |
| | PUMaC 2024 | **96%** | 93% | 95% | **96%** |
| | SMT 2025 | **99%** | 92% | 94% | 96% |
| IRT-Estimated Elo | **IRT-Estimated Elo** | **1260 ± 28** | 749 ± 46 | 999 ± 27 | 800 ± 28 |
Notably, two of the benchmarks with the biggest capability gap are CAISI-internal/private ones (CTF-Archive-Diamond, PortBench). I read this as "DeepSeek is well-tuned for public benchmarks, and less generally intelligent than GPT5.5 on held-out tasks" but a less-charitable reading would be "US government reports US models do best on benchmarks that only the US government can run". Agent benchmarking is fraught with peril [2] and an impartial benchmarker (who disproportionately overlooks bugs/issues in their evaluation of certain models) can absolutely tilt the scales, so I would not be surprised if a PRC-led benchmarking of frontier models came to the opposite conclusion.[1] https://www.nist.gov/news-events/news/2026/05/caisi-evaluati...
[2] https://epoch.ai/gradient-updates/why-benchmarking-is-hard