Interesting, how do you handle the observability side during training? One thing I ran into with multi-agent RL is that reward signals alone don't tell you much about why an agent is failing. Curious if you've built any tooling around that.
Browser agents are the use case where RL makes the most sense -
the reward signal is obvious (did the task get done or not) and
the action space is bounded. Curious how you handle the credit
assignment problem across multi-step navigation though.