1 pointby traceopt-ai6 hours ago1 comment
  • traceopt-ai6 hours ago
    This tool focuses on finding stragglers in multi-GPU PyTorch (DDP) training. In practice, one slow rank often gates the entire step, but it is hard to see which GPU is lagging and why.

    This is early and single-node only for now. Feedback welcome.