The dataset covers 194 challenges across 5 categories (cryptography, web exploitation, forensics, reverse engineering, binary exploitation) tested against 10 model configurations including GPT-4o, Claude 3.5 Sonnet, and Claude 3.7 Sonnet.
Key finding: even the best frontier models solve only a small fraction of professional CTF challenges. Claude 3.5 Sonnet performed best at 20% overall. Binary exploitation was hardest across all models.
Full dataset, visualizations, and methodology in the Kaggle link. Any Feedback at all is greatly appreciated.
if you guys use this data set for any project, please tell me I don't even need credits.