I am digging into this for some time and the main reason seems to be in definition of what good review is. Its less about finding bugs and more about code evolvability, maintainability and simplicity. without some context that would give a hint how the app will evolve over time, is maintained and what is simple/hard for live person - llm struggle to deliver useful insights here.
here you can find some related papers: https://github.com/sermakarevich/ai_knowledge_wiki/blob/mast...