One thing that surprised me is how much code citation data is in most of the models training data already. Where the agents still fall apart is visual analysis like a corroded valve photo with a vague description and they'll confidently cite the wrong API standard. That gap is most of where the 87% delta comes from for us.
Happy to walk through specific cases if anyone wants to dig in.