The #1 spot in the ranking is both more of a deal and less of a deal than it might appear. It's less of a deal in that HackerOne is an economic numbers game. There are countless programs you can sign up for, with varied difficulty levels and payouts. Most of them pay not a whole lot and don't attract top talent in the industry. Instead, they offer supplemental income to infosec-minded school-age kids in the developing world. So I wouldn't read this as "Xbow is the best bug hunter in the US". That's a bit of a marketing gimmick.
But this is also not a particularly meaningful objective. The problem is that there's a lot of low-hanging bugs that need squashing and it's hard to allocate sufficient resources to that. Top infosec talent doesn't want to do it (and there's not enough of it). Consulting companies can do it, but they inevitably end up stretching themselves too thin, so the coverage ends up being hit-and-miss. There's a huge market for tools that can find easy bugs cheaply and without too many false positives.
I personally don't doubt that LLMs and related techniques are well-tailored for this task, completely independent of whether they can outperform leading experts. But there are skeptics, so I think this is an important real-world result.
Below people are reading the tea leaves to get any clue.
AI can be remarkably effective at discovering a broad range of vulnerabilities—but the real challenge isn’t always detection, It’s precision. Automation has long struggled with false positives, and nowhere is this more evident than in vulnerability scanning. Tools that flag dozens of irrelevant issues often create more work than they save. When AI enters the equation, the stakes grow even higher: models can generalize well, but verifying technical edge cases is a different game entirely.
To ensure accuracy, we developed the concept of validators, automated peer reviewers that confirm each vulnerability XBOW uncovers. Sometimes this process leverages a large language model; in other cases, we build custom programmatic checks. For example, to validate Cross-Site Scripting findings, a headless browser visits the target site to verify that the JavaScript payload was truly executed. (don’t miss Brendan Dolan-Gavitt’s BlackHat presentation on AI agents for Offsec)
The glaring omission here is a discussion of how many bugs the XBOW team had to manually review in order to make ~1k "valid" submissions. They state:
> It was a unique privilege to wake up each morning and review creative new exploits.
How much of every morning was spent reviewing exploits? And what % of them turned out to be real bugs? These are the critical questions that are (a) is unanswered by this post, and (b) determine the success of any product in this space imo.
What is the top talent spending its time on?
To pay bills: often working for tier A tech companies on intellectually-stimulating projects, such as novel mitigations, proprietary automation, etc. Or doing lucrative consulting / freelance work. Generally not triaging Nessus results 9-to-5.
I don't think LLM replace humans, they do free up time to do nicer tasks.
Succinct description of HN. It’s a damn shame.
> To bridge that gap, we started dogfooding XBOW in public and private bug bounty programs hosted on HackerOne. We treated it like any external researcher would: no shortcuts, no internal knowledge—just XBOW, running on its own.
Is it dogfooding if you're not doing it to yourself? I'd considerit dogfooding only if they were flooding themselves in AI generated bug reports, not to other people. They're not the ones reviewing them.
Also, honest question: what does "best" means here? The one that has sent the most reports?
22/24 (Valid / Closed) for Walt Disney
3/43 (Valid / Closed) for AT&T
The market for bounties is a circus, breadcrumbs for free work from people trying to 'make it'. It can safely be analogized to the classic trope of those wanting to work in games getting paid fractional market rates for absurd amounts of QA effort. The number of CVSS vulns with a score above 8 that have floated across the front page of HN in the past year without anyone getting paid tells you that much.
You make it sound like there's a ton of people going around who can just dig up CVSS vulns above 8 and is making me all confused. Is that really happening? I have a single bounty on H1 just to show I could do it, and that still took ages and was a shitty bug.
Some of that is likely down to company policies; Snapchat's policy, for example, is that nothing is ever marked invalid.
Seems reasonable to call that dogfooding considering that flooding themselves wouldn't be any more useful than synthetic testing and there's only so much ground they could cover using it on their own software.
If this were coming out of Microsoft or IBM or whatever then yeah, not really dogfooding.
- Design the system and prompts
- Build and integrate the attack tools
- Guide the decision logic and analysis
This isn’t just semantics — overstating AI capabilities can confuse the public and mislead buyers, especially in high-stakes security contexts.
I say this as someone actively working in this space. I participated in the development of PentestGPT, which helped kickstart this wave of research and investment, and more recently, I’ve been working on Cybersecurity AI (CAI) — the leading open-source project for building autonomous agents for security:
- CAI GitHub: https://github.com/aliasrobotics/cai
- Tech report: https://arxiv.org/pdf/2504.06017
I’m all for pushing boundaries, but let’s keep the messaging grounded in reality. The future of AI in security is exciting — and we’re just getting started.
Who would it be, gremlins? Those humans weren't at the top of the leaderboard before they had the AI, so clearly it helps.
The future is definitely a combination of human and bots like anything else, it won't replace the humans just like coding bots won't replace devs. In fact this will allow humans to focus ob the fun/creative hacking instead of the basic/boring tests.
What I am worried about is on the triage/reproduction side, right now it is still mostly manual and it is a hard problem to automate.
https://hackerone.com/xbow?type=user
Which shows a different picture. This may not invalidate their claim (best US), but a screenshot can be a bit cherry-picked.
While niche and not widely used; there are at least thousands of publicly available servers for each of these projects.
I genuinely think this is one of the biggest near term issues with AI. Even if we get great AI "defence" tooling, there are just so many servers and (IoT or otherwise) devices out there, most of which is not trivial to patch. While a few niche services getting pwned isn't probably a big deal, a million niche services all getting pwned in quick succession is likely to cause huge disruption. There is so much code out there that hasn't been remotely security checked.
Maybe the end solution is some sort of LLM based "WAF" that inspects all traffic that ISPs deploy.
There is also a BIG hurdle between crashing something (which generally will be detected), versus RCE which requires a lot more work.
That seems a bit unethical. I’ve thought companies specifically deny usage of automated tools. A bit too late ey…?
bug works and is repro - as a software owner, do you care if human or ai found it?
The tooling and models are maturing quickly and there is definitely some value in autonomous security agents, both offensive and defensive- but also still requires alot of work, knowledge(my group is all ML people), skill, planning- if you want to approach anything more than bug bashing.
This recent paper from Dreadnode discusses a benchmark for this sort of challenge: https://arxiv.org/abs/2506.14682
Yikes, explains why my manually submitted single vulnerability is taking weeks to triage.
I see what you're saying but I think a more charitable interpretation can be made. They may be amazed that so many bug reports are being generated by such a reputable group. Looking at your initial reply, perhaps a more constructive comment could be one that joins them in excitement (even if that assumption is erroneous) and expanding on why you think it is exciting (e.g. this group's reputation for quality).
I took instead the opposite - that they were no longer shocked that it was taking so long once they found out why, as they knew who they were and understood.
>130 resolved
>303 were classified as Triaged
>33 reports marked as new
>125 remain pending
>208 were marked as duplicates
>209 as informative
>36 not applicable
20% bind a lot of resources if you have a high input on submissions and the numbers will rise
Basically if you are new, the reviewer thinks "oh, a rando" and in his mind he has already downgraded the severity a bit.
It's unfortunately a kind of cartel at this point. Not full fledged and out but a low key cartel. They have a circle of friends whose csrf would also get better valuation. It's a sorry state.
The problem is that the people who know how to use AI properly will slower and more careful in their submissions.
Many others won’t, so we‘ll get lots of noise hiding the real issues. AI makes it easy to produce many bad results in short time.
They are faster than the purely manual ones but can’t beat the AI created bad ones neither in speed nor numbers.
It’s like the IT security version of the Gish gallop.
But there's a claim that it is unsupervised, which I doubt. See how these two claims contradict each other.
>"XBOW is a fully autonomous AI-driven penetration tester. It requires no human input, "
>"To ensure accuracy, we developed the concept of validators, automated peer reviewers that confirm each vulnerability XBOW uncovers. Sometimes this process leverages a large language model; in other cases, we build custom programmatic checks."
I mean, I doubt you deploy this thing collecting thousands of dollars in bounties and you sit there twiddling your thumbs. Whatever work you put into the AI, whether fine tuned or generic and reusable, counts as supervised, and that's ok. Take the win, don't try to sell the automated dream to get investors or whatever, don't get caught up in fraud.
As I understand it, when you discover a type of vulnerabilities, it's very common to automate the detection and find other clients with such vulnerability, these are usually short lived and the well dries up fast, you need to constantly stay on top of the latest trends. I just don't buy that if you leave this thing unattended for even 3 months it would keep finding gold, that's a property of the engineers that is not scaleable (and that's ok).
The thing about bug bounties, the only way to win is to not play the game.
I think within the next 5 years or so, we are going to see a societal pattern repeating: any program that rewards human ingenuity and input will become industrialized by AI to the point where it becomes a cottage industry of companies flooding every program with 99% AI submissions. What used to be lone wolves or small groups of humans working on bounties will become truckloads of AI generated “stuff” trying to maximize revenue.
> What used to be lone wolves or small groups of humans working on bounties will become truckloads of AI generated “stuff” trying to maximize revenue.
You're objecting to the wrong thing. The purpose of a bug bounty programme is not to provide a cottage industry for security artisans - it's to flush out security vulnerabilities.
There are reasonable objections to AI automation in this space, but this is not one of them.
I had one critical bug take 3 years to get a pay out. I had a full walkthrough with videos and report. The company kept stalling and at one point told me that because they completely had the app remade, they weren't going to pay me anything.
Hackerone doesn't really protect the researcher either. I was told multiple times that there was 'nothing they could do'.
I eventually got paid, but this is pretty normal behavior with regards to bug bounty. Too many companies use it for free security work.
Most companies should not do bug bounties.
but foss is foss, i guess source available doesnt mean we have to read your messages see sqlite (wont even take PR's lol)
There is a reason companies like hackerone exist - its because dealing with the submissions is terrible.
If it's not reliable, how can you rely on the written issue to be correct, or the review, and so how does that benefit you over just blindly merging whatever changes are created by the model?
It’s not real.
But you can bet someone will sell that as the solution.
OK.. but "getting more bugs fixed" isn't any kind of objective success metric for, well, anything, right?
It's fine if you want to use it as a KPI for your specific thing! But it's not like it's some global KPI for everyone?
When we put our product on there, roughly 2019, the enterprising hackers ran their scanners, submitted everything they found as the highest possible severity to attempt to maximize their payout, and moved on. We wasted time triaging all the stuff they submitted that was nonsense, got nothing valuable out of the engagement, and dropped HackerOne at the end of the contract.
You'd be much better off contracting a competent engineering security firm to inspect your codebase and infrastructure.
Some of my favorites from what we've released so far:
- Exploitation of an n-day RCE in Jenkins, where the agent managed to figure out the challenge environment was broken and used the RCE exploit to debug the server environment and work around the problem to solve the challenge: https://xbow.com/#debugging--testing--and-refining-a-jenkins...
- Authentication bypass in Scoold that allowed reading the server config (including API keys) and arbitrary file read: https://xbow.com/blog/xbow-scoold-vuln/
- The first post about our HackerOne findings, an XSS in Palo Alto Networks GlobalProtect VPN portal used by a bunch of companies: https://xbow.com/blog/xbow-globalprotect-xss/
Like any "AI" article, this is an ad.
If you are willing to tolerate a high false positive rate, you can as well use Rational Purify or various analyzers.
https://www.blackhat.com/us-25/briefings/schedule/#ai-agents...
I know you've been on HN for awhile, and that you're doing interesting stuff; HN just has a really intense immune system against vendor-y stuff.
I'll see if I can get time to do a paper to accompany the BH talk. And hopefully the agent traces of individual vulns will also help.
> White Paper/Slide Deck/Supporting Materials (optional)
> • If you have a completed white paper or draft, slide deck, or other supporting materials, you can optionally provide a link for review by the board.
> • Please note: Submission must be self-contained for evaluation, supporting materials are optional.
> • PDF or online viewable links are preferred, where no authentication/log-in is required.
(From the link on the BHUSA CFP page, which confusingly goes to the BH Asia doc: https://i.blackhat.com/Asia-25/BlackHat-Asia-2025-CFP-Prepar... )
I remember your work on seeding vulnerabilities into C programs. I didnt know you got into AI-assisted pentesting. I already have more confidence in the product. :)
Another great reading is [1](2024).
[1] "LLM and Bug Finding: Insights from a $2M Winning Team in the White House's AIxCC": https://news.ycombinator.com/item?id=41269791