I've been learning robotics by training a vision-language-action model, deployed to a Raspberry Pi with a pan/tilt camera mount, to track faces based on text commands. This article covers one of the issues I ran into with creating a useful simulation environment.