anything tbh. as long as you have runbook - you can try to automate actions through nvsx; it sits on top of NVSentinel.
restarting will work mostly for smaller jobs - distributed training, pretty common will need more fault tolerant methods to continue rather than just restarting.