Fault Tolerance Benchmark: Clockwork TorchPass, TorchFT and Checkpoint Restart(clockwork.io)

4 pointsby danzheng4 hours ago3 comments

asaiacai10 minutes ago
have you guys tested this against any resource/batch managers in k8s (i.e. kueue, volcano, apache yunikorn)? seems like this is a good fit for people who have already have a large cluster. does it handle autoscaling environments well in your experience?
essekar3 hours ago
One of the feedbacks we got while testing was - we might even reduce the duration of checkpointing. Which was a huge insight.
We are testing out & would love more collaboration from the community - if your team is running training jobs, hit us up.
- gdcohenan hour ago
  I agree. If you can reliably recover from failures by migrating instead of restoring from checkpoints, then the need to checkpoint frequently becomes far less important and reduces the overhead of taking checkpoints. Similarly, a lot of users are experimenting with asynchronous checkpoints to reduce the blocking penalty, but there are big tradeoffs there with DRAM usage. So being able to take checkpoints far less frequently gives you more flexibility on the type of checkpointing you use as well.
danzheng4 hours ago
Hi all — I’m Dan, founding team at clockwork.io. Today we launched TorchPass! We'd love to get your feedback.
tl;dr: we built TorchPass because large distributed training jobs fail a lot, and checkpoint restart is expensive. TorchPass addresses this by migrating the state from failed resources to spares.
In large GPU clusters, even small failures (a GPU falling off the bus, a node crash, a network link flap) can bring down an entire distributed training job. And once you get into clusters with hundreds or thousands of GPUs, something is almost always failing. Research from Meta suggests mean time to failure drops to about 7.9 hours for a 1,024-GPU cluster. And when a single failure occurs, the entire distributed job crashes.
The usual recovery model is to take frequent checkpoints during training, and recover from the most recent checkpoint when a failure occurs. But:
all work since the last checkpoint is lost time is wasted replacing nodes and checkpoint reloading more time is lost restarting the entire distributed job
TorchPass uses a different approach: instead of restarting the job, it migrates the failed training rank to a spare GPU and resumes training at the same step.
TorchPass supports planned migration (triggered pre-emptively when an imminent failure is detected) or unplanned migration (triggered by a hard failure). Further details about how it works can be found here:https://clockwork.io/blog/torchpass-workload-fault-tolerance...
We ran a 3,000 step training benchmark using TorchTitan Llama-4 MoE Scout (109B) on 64 H200 GPUs with random failure injection to compare checkpoint restarts, TorchPass and TorchFT.
TorchPass completed in 405 min Checkpoint restart completed in 818 min. TorchFT in 930 min
Checkpoint restart was slower mainly because of the time taken to restore from checkpoint, restart the training and recompute the work since the last checkpoint.
TorchFT lost almost no time due to the failures, but was slower because it introduced a significant per-step overhead because it requires the using gloo (rather than NCCL) for cross replica all reduce operations.
Happy to answer questions about the implementation and benchmarks.