The title implies some novel research or a review of existing research that that clearly shows agents are better at code review than humans but then provides this single paragraph on the review capabilities of agents:
> Beyond general software engineering, several strands of work speak specifically to the capabilities that code review re- quires. Pornprasit and Tantithamthavorn evaluate LLM-based automated review in industrial settings and find that agents detect the same categories of defect that human reviewers target: correctness errors, security weaknesses, performance inefficiencies, and style violations [12]. Li et al. demonstrate that CodeReviewer produces actionable inline comments at quality that is at least comparable to those of trained human reviewers on a significant fraction of the evaluation set [11].
- I disagree with the contributions
- I suspect most of the paper was written / edited by "generative AI"
It is a shame that "researchers" have started using generative AI to such a large degree as it now masks the voice of the person. Generative AI tends to claim things that are not true and tends to use words that are unsuitable. This text leaves a bad taste in my mouth.
(Edits: formatting)
- The paper is garbage, and a human review process would reject it.
__As is the case in this discussion__. The sentiment of the discussion section (at the time of writing) seems to be in favor of rejecting this paper. (Of course, to make it a proper experiment, one would need to also give the paper to a "generative AI" reviewer to see if it would reject it or not, but I cannot bother.)
Who said it has to "scale with AI-assisted throughput"? AI can produce code all day, the goal is not to fill storage with AI code, is to make products, following product tradeoffs, timelines, and decisions.
But yeah - I can have one LLM check another LLMs work. Kind of a waste of tokens for most PRs.
Not sure I can agree with this premise, especially since there seems to be a complete lack of "real-world results" in this evaluation. This strikes me as being written by a theorist, who's only experience with Quality Assurance exists in studies or papers.
Yes you can setup guardrails and validations and other things, but as long its primary brain is demonstrably full of holes, I feel obligation to at very least be always be capable to take over the code. To me this means it should stay human-maintainable, and that's for humans to decide what is and isn't, one review at a time.
"Can't scale due to too many PRs" neglects answering questions like: Are these PRs valuable? Are they just additional PRs to right the wrongs of previous ill-conceived PRs? How much churn is going on here? Is the influx of PRs a permanent state, or something that we'll only live through temporarily because we have a lot of little things we can set our agents upon, but after they're done we'll return to a normal work cadence?
I think the only real solution is to add increasingly strict guardrails that can be enforced with a combination of more AI agents and actual executable contracts. The other aspect is through using languages and tools that densify correctness. i.e languages like Rust that have very rich type system so both review and design can be focused on a small by volume slice which is the core types. The other main tools for densifying correctness are formal methods, (model checking, etc), fuzzing/property based testing and static analysis.
All of these tools are cheaper to use than they once were because of lot of the minutiae can be handled AI agents while core invariants can receive heavy human scrutiny.
IMO generative AI is here to stay in development so may as well get ahead of the game and start using these tools to try get the best out of it.
That's a pretty bike take, but I think a bit less than what this paper is saying.
A more mild sub-argument I really enjoyed recently was just that we have basically an obligation to have AI code review:
> Nuclear opinion: I don’t care if you use AI to write code,¹ but making a human review a change before an LLM is disrespectful.
> Why am I catching your bugs if a computer could have done it? It’s like asking to review code that doesn’t compile or pass CI.
https://bsky.app/profile/filippo.abyssdomain.expert/post/3mo...
This makes me also really appreciate a thread that shows up recently, where folks were talking about their code/repo forges. Using tickets/PRs as rendezvous for where people and agents come together had very high appeal to me. Being able to surface discussion in a shared context seems crucial.
https://btao.org/posts/2026-05-09-the-forge-we-deserve/ https://news.ycombinator.com/item?id=48582147
Have you seen the decisions LLMs make? They write code like the worst developers I know. They're lazy, short-sighted, and impossible to teach.