The dataset was cleaned before analysis: 99% of Informational-severity findings and ~40% of Low-severity were removed, as they consistently lacked sufficient detail to be informative.
The goal was to quantify report quality — not just flag vulnerabilities, but measure how well each one is documented. This became the foundation for a RAG-based audit assistant I've been building, where data quality has an outsized effect on output quality.
Scoring methodology:
Each finding was scored on three primary dimensions — description depth, remediation quality, and presence of a PoC. PoC carried the highest weight, as it is the most reliable signal of a useful report. Solidity snippets and severity level contributed additional points. Raw scores (0–15) were log-normalized to 0–1 to prevent score concentration at the top.
Key findings:
— Total findings analyzed: 23,625 — Mean score: 0.32 | Median: 0.27 — Distribution is multimodal with three distinct quality tiers (~0.05, ~0.25, ~0.60) — ~25% of findings score above 0.51 — these form the high-quality tier ("golden data fund") — All three normality tests confirm the distribution is significantly non-Gaussian
Most counterintuitive result: Critical-severity bugs score lower on average (0.33) than High-severity ones (0.53). Critical findings tend to be reported as brief alerts without PoC — the severity speaks for itself, so the write-up gets less attention. High findings, by contrast, typically include more thorough documentation. This is a problem: the bugs most likely to cause catastrophic losses are often the least well-documented.
What this means in practice:
The three-peak distribution reflects real behavioral patterns in how auditors write reports. The first cluster (scores ~0.05) represents minimal one-liner findings with no context. The second (~0.25) covers standard reports with a description but no PoC. The third (~0.60) is the minority that includes everything: a clear description, remediation steps, and working exploit code. Only this last group is genuinely useful for both AI training and human review.
For the broader ecosystem, the takeaway is uncomfortable: the current standard of audit reporting leaves most findings underexplained. A well-documented bug with a PoC can be understood, reproduced, and fixed in hours. A vague one-liner can stay misunderstood for weeks — or get silently ignored in the next audit cycle.
If you want to see the full distribution charts and statistics for yourself, I put together an interactive notebook with all the visualizations:
https://colab.research.google.com/drive/1Wp4yyEmXYjHATak7Bmy...
Open to questions on methodology or dataset composition.