Stick to the math, and make an argument that the hardware behaviour drifts outside documented bounds, taking into account the existing non-determinism in the system, e.g. CUDA thread atomics, or batch sizes and layouts if you're layering concurrency on top.
You might have some good research here, but it's buried under LLM slop. This style of writeup is not likely to grab Nvidia's attention, I think.