1 pointby essekar7 hours ago2 comments
  • nigardev7 hours ago
    curious what specific fault modes this handles. is it mainly for ECC errors or something else like timeout recovery. also wondering how this compares to just restarting the affected process, which has worked for us on workstation GPUs
    • essekar6 hours ago
      anything tbh. as long as you have runbook - you can try to automate actions through nvsx; it sits on top of NVSentinel. restarting will work mostly for smaller jobs - distributed training, pretty common will need more fault tolerant methods to continue rather than just restarting.
  • essekar7 hours ago
    [dead]