3 pointsby birdculturea day ago1 comment
  • thundergolfera day ago
    Hey, author here :)

    I wrote this as a first step in exposing our internal GPU reliability management, inspired in large part by SemiAnalysis' focus on industry best practices in the ClusterMax report and Lepton's `gpud`.

    NVIDIA GPUs have 172 "Xid" error codes and increase the active population with each new major driver release. Coming out of 2025, we have a good handle on which Xids (and SXids) are critical and can be automated away.

    The interesting next frontier are the Xids that sit ambiguously as maybe hardware issues or maybe application issues. Xid 31, GPU memory page fault, is the most dreaded code, because ~95% of the time it's an application exception but it's pretty tricky for users to debug and confirm.

    Automating Xid 31 handling is my new GPU reliability holy grail.