1 pointby Aeroi2 hours ago2 comments

nigardevan hour ago
visual analysis is the right bottleneck to call out. most coding agents can read and write code fine because its just text. but identify a corroded valve from a photo and suggest the right fix? thats a different problem entirely. curious how your benchmark scores the gap between text-reasoning and visual-reasoning tasks
- Aeroian hour ago
  [dead]
Aeroi2 hours ago
One thing that surprised me is how much code citation data is in most of the models training data already. Where the agents still fall apart is visual analysis like a corroded valve photo with a vague description and they'll confidently cite the wrong API standard. That gap is most of where the 87% delta comes from for us.
Happy to walk through specific cases if anyone wants to dig in.