The truth is we’re all still experimenting and shovels of all sizes and forms are being built.
What I keep running into is the step before reading tests or code: when a change is large or mechanical, I’m mostly trying to answer "did behavior or API actually change, or is this mostly reshaping?" so I know how deep to go etc.
Agree we’re all still experimenting here.
I ran into this problem while reviewing AI-gen refactors and started thinking about whether we’re still reviewing the right things. Mostly curious how others approach this.
Its common to change git's diff to things like difftastic, so formatting slop doesn't trigger false diff lines.
You're probably better off, FWIW, just avoiding LLMs. LLMs cannot produce working code, and they're the wrong tool for this. They're just predicting tokens around other tokens, they do not ascribe meaning to them, just statistical likelihood.
LLM weights themselves would be far more useful if we used them to indicate statistical likelihood (ie, perplexity) of the code that has been written; ie, strange looking code is likely to be buggy, but nobody has written this tool yet.
My question is slightly orthogonal though: even with a cleaner diff, I still find it hard to quickly tell whether public API or behavior changed, or whether logic just moved around.
Not really about LLMs as reviewers — more about whether there are useful deterministic signals above line-level diff.