We are testing out & would love more collaboration from the community - if your team is running training jobs, hit us up.
tl;dr: we built TorchPass because large distributed training jobs fail a lot, and checkpoint restart is expensive. TorchPass addresses this by migrating the state from failed resources to spares.
In large GPU clusters, even small failures (a GPU falling off the bus, a node crash, a network link flap) can bring down an entire distributed training job. And once you get into clusters with hundreds or thousands of GPUs, something is almost always failing. Research from Meta suggests mean time to failure drops to about 7.9 hours for a 1,024-GPU cluster. And when a single failure occurs, the entire distributed job crashes.
The usual recovery model is to take frequent checkpoints during training, and recover from the most recent checkpoint when a failure occurs. But:
all work since the last checkpoint is lost time is wasted replacing nodes and checkpoint reloading more time is lost restarting the entire distributed job
TorchPass uses a different approach: instead of restarting the job, it migrates the failed training rank to a spare GPU and resumes training at the same step.
TorchPass supports planned migration (triggered pre-emptively when an imminent failure is detected) or unplanned migration (triggered by a hard failure). Further details about how it works can be found here:https://clockwork.io/blog/torchpass-workload-fault-tolerance...
We ran a 3,000 step training benchmark using TorchTitan Llama-4 MoE Scout (109B) on 64 H200 GPUs with random failure injection to compare checkpoint restarts, TorchPass and TorchFT.
TorchPass completed in 405 min Checkpoint restart completed in 818 min. TorchFT in 930 min
Checkpoint restart was slower mainly because of the time taken to restore from checkpoint, restart the training and recompute the work since the last checkpoint.
TorchFT lost almost no time due to the failures, but was slower because it introduced a significant per-step overhead because it requires the using gloo (rather than NCCL) for cross replica all reduce operations.
Happy to answer questions about the implementation and benchmarks.