SXM was the most expensive per hour but cheapest to finish: SXM: 702ms/step - ~$37 (using vast.ai) PCIe: 1,412ms/step - ~$112 (runpod) NVL: 2,032ms/step - ~$181 (runpod)
My first SXM run hit 1,295ms. Barely faster than PCIe. Nsight OS runtime summary led me to suspect CPU starvation. I found a higher vCPU instance on Vast.ai which hit 700ms. The 128 vCPU SXM instance also hit ~700ms, so it wasn't CPU count.
Looking at the network topology on Runpod and vast.ai, the first instance had GPUs split 4+4 across two NUMA nodes. NCCL's data transfer uses NVSwitch and is unaffected, but the control threads run on CPU. Cross-socket latency on every pthread_cond_signal added up.
NVL was the most confusing result, NCCL kernel times nearly identical to PCIe, but step times 44% worse. Only 4 of 28 GPU pairs share NVLink on NVL, the rest fall back to PCIe. I don't have a full explanation for this yet.
Profiling script: https://github.com/Nikhil-Kasukurthi/nanochat/blob/master/sc... Script with startup checks to ensure instance is healthy: https://github.com/Nikhil-Kasukurthi/nanochat/blob/master/ru...
Happy to discuss, especially if anyone has ideas on the NVL anomaly.