> Within 2-3 minutes it had found a 'promising' exploit, that initially failed because of some naïve filtering in the app. Another 2 minutes later it figured an encoding mechanism that bypassed the filtering the app did and it had found a complete RCE, and written a full proof of concept.
Yeah, the feedback loop brings it from "omg did you hear that curl closed their bounty program because of slop" to "cve or gtfo". I have no doubt there are many teams that are doing this at scale, even with less capable models (local oss). If the model has a feedback loop and an easily testable success criteria, this becomes a pass@n problem, and it scales with "just money".