Which H100 Instance to Train Nanochat – Benchmarking PCIe, SXM, and NVL(bluenotebook.io)

1 pointby k2so5 hours ago2 comments

k2so5 hours ago
Author here. I wanted to train Nanochat d26 to GPT-2 level and had to pick between three H100 variants on Runpod.
SXM was the most expensive per hour but cheapest to finish: SXM: 702ms/step - ~$37 (using vast.ai) PCIe: 1,412ms/step - ~$112 (runpod) NVL: 2,032ms/step - ~$181 (runpod)
My first SXM run hit 1,295ms. Barely faster than PCIe. Nsight OS runtime summary led me to suspect CPU starvation. I found a higher vCPU instance on Vast.ai which hit 700ms. The 128 vCPU SXM instance also hit ~700ms, so it wasn't CPU count.
Looking at the network topology on Runpod and vast.ai, the first instance had GPUs split 4+4 across two NUMA nodes. NCCL's data transfer uses NVSwitch and is unaffected, but the control threads run on CPU. Cross-socket latency on every pthread_cond_signal added up.
NVL was the most confusing result, NCCL kernel times nearly identical to PCIe, but step times 44% worse. Only 4 of 28 GPU pairs share NVLink on NVL, the rest fall back to PCIe. I don't have a full explanation for this yet.
Profiling script: https://github.com/Nikhil-Kasukurthi/nanochat/blob/master/sc... Script with startup checks to ensure instance is healthy: https://github.com/Nikhil-Kasukurthi/nanochat/blob/master/ru...
Happy to discuss, especially if anyone has ideas on the NVL anomaly.
- CamperBob25 hours ago
  I'd just get an RTX Pro 6000 Blackwell and call it a day. More VRAM. Somewhat less bandwidth but it's your bandwidth.
  - k2so4 hours ago
    Yeah, for a single GPU inference, considering the higher VRAM and FP4 support on the RTX 6000, it should fit larger models as well than the H100.
5 hours ago
undefined