Hacker News
new
top
best
ask
show
job
Show HN: Finding stragglers in multi-GPU PyTorch (DDP) training
(
github.com
)
1 point
by
traceopt-ai
6 hours ago
1 comment
traceopt-ai
6 hours ago
This tool focuses on finding stragglers in multi-GPU PyTorch (DDP) training. In practice, one slow rank often gates the entire step, but it is hard to see which GPU is lagging and why.
This is early and single-node only for now. Feedback welcome.