1 pointby k2so5 hours ago2 comments
  • k2so5 hours ago
    Author here. I wanted to train Nanochat d26 to GPT-2 level and had to pick between three H100 variants on Runpod.

    SXM was the most expensive per hour but cheapest to finish: SXM: 702ms/step - ~$37 (using vast.ai) PCIe: 1,412ms/step - ~$112 (runpod) NVL: 2,032ms/step - ~$181 (runpod)

    My first SXM run hit 1,295ms. Barely faster than PCIe. Nsight OS runtime summary led me to suspect CPU starvation. I found a higher vCPU instance on Vast.ai which hit 700ms. The 128 vCPU SXM instance also hit ~700ms, so it wasn't CPU count.

    Looking at the network topology on Runpod and vast.ai, the first instance had GPUs split 4+4 across two NUMA nodes. NCCL's data transfer uses NVSwitch and is unaffected, but the control threads run on CPU. Cross-socket latency on every pthread_cond_signal added up.

    NVL was the most confusing result, NCCL kernel times nearly identical to PCIe, but step times 44% worse. Only 4 of 28 GPU pairs share NVLink on NVL, the rest fall back to PCIe. I don't have a full explanation for this yet.

    Profiling script: https://github.com/Nikhil-Kasukurthi/nanochat/blob/master/sc... Script with startup checks to ensure instance is healthy: https://github.com/Nikhil-Kasukurthi/nanochat/blob/master/ru...

    Happy to discuss, especially if anyone has ideas on the NVL anomaly.

    • CamperBob25 hours ago
      I'd just get an RTX Pro 6000 Blackwell and call it a day. More VRAM. Somewhat less bandwidth but it's your bandwidth.
      • k2so4 hours ago
        Yeah, for a single GPU inference, considering the higher VRAM and FP4 support on the RTX 6000, it should fit larger models as well than the H100.
  • 5 hours ago
    undefined