Regarding the verifier that plays against the live engine, I’ve approached the problem from a similar angle by having LLM agents effectively borrow a page from the speedrunning community in the form of tool-assisted speedruns, allowing the LLM access only to a virtualized game controller.
[1] - https://store.steampowered.com/app/346850/Chips_Challenge_1
Curious about your agent setup though. Any public repo?
I don't have a GH repo up for the TAS system yet - it's a bespoke mess right now since it was built with the old game "Castle of the Winds" in mind but I'll definitely consider it in the future!