Predict your distributed LLM training time before you burn GPU hours(github.com)

2 pointsby barthelomew16 days ago1 comment

barthelomew16 days ago
Predict your distributed LLM training time before you burn GPU hours.
We've open-sourced a tool (https://github.com/DebarghaG/estimate-train-time) that estimates wall-clock time for LLM training across multi-GPU setups with 3D parallelism (pipeline, tensor, and data).
This problem is extremely hard: you're modeling the interplay of thousands of GPU kernels, NCCL collectives across heterogeneous network topologies, pipeline bubbles, activation recomputation, and ZeRO optimizer communication all while these components interact in non-obvious ways at scale. Even off-by-2x estimates are useless for capacity planning.
Two years of painstaking work, ~$100k worth of cluster time, validated on real workloads at Perlmutter (NERSC) and Vista (TACC) some of the largest HPC clusters available for open science.
How it works: 1. Kernel-level profiling: We sample execution times for kernels like Flash Attention, fused GEMM (QKV/FFN projections), RMSNorm, embedding lookups, and cross-entropy loss across the (batch, seq_len, hidden_dim, num_heads, MP degree) parameter space. 2. Communication modeling: NCCL benchmarks capture ring all-reduce (tensor/data parallel sync), all-gather (ZeRO-1 parameter collection), and P2P send/recv (pipeline stage activation transfers) across intra-node NVLink and inter-node InfiniBand topologies. 3. Analytical composition: Operator predictions feed into a pipeline scheduling model (AF-AB / 1F1B) that accounts for bubble overhead: (PP - 1) / (num_microbatches + PP - 1) idle fraction, layer distribution across head/middle/tail stages, and overlapped DP gradient sync. 4. Runs on CPU (post-sampling) no GPU access needed for inference of training time.
This is highly extensible as a recipe. You may profile your own hardware with bundled kernel-sampling and NCCL-benchmarking scripts. You can add custom operators by implementing the regressor interface.
This work builds on our HiPC 2025 paper on fine-grained GPU performance modeling. Earlier code to reproduce results in paper: https://github.com/ICICLE-ai/distributed_training_estimator_...
Looking for early adopters and feedback especially teams doing parallelism strategy search or capacity planning at scale.