Nice writeup. If I am reading it right, the core bug was that two independent paths each generated the per-invocation namespace, and the fix was to generate it once and propagate it to the workers. That reads more like an implementation slip than a design failure. The multi-level naming scheme is sound, it just has to be computed in exactly one place.
The part I would actually love more detail on is the agent side. How are you orchestrating agents to fan out the test runs?