I think the big thing I was trying to highlight in this article was the fact the not much effort has been put into spatial and image awareness. In my limited experiments where I would manually ask the models to take an image and highlight things (like "circle all elbows") it does a great job... but if you ask the model where an elbow is in the image (in pixels), it does a poor job.
Or maybe put another way, going from `image->model->tool` seems to be an area for improvement.