Is it possible for pytorch>1.4 built with CUDA9.0 to have anomaly behavior?

Hi!
I built pytorch 1.5 with CUDA9.0 myself. Recently I encountered NaN/inf values when doing network forward. But the same code works fine on other servers(pytorch1.5+CUDA10.2).
Is it possible for pytorch>1.4 built with CUDA9.0 to have anomaly behavior?

  • The random NaN was introduced by 1 of 64 weights in first conv layer suddenly became NaN.

It seems one of my gpus has been broken…

How did you figure it out?

I ran the same code/data on different gpus, by setting the CUDA_VISIBLE_DEVICES. And only the gpu:0 gave me NaN after random epochs.
I never met this after a server reboot. :sweat_smile: