Got Different Inference Conv2d Results on Different GPU Machine

Hi,

I trained a model, and test the model in different GPU machine: (i) NVIDIA GeForce RTX 3090 and (ii) GeForce RTX 2080 Ti. However, I got different accuracy results in both machine. Note that I use exactly same code base and same trained model, only evaluation mode is performed here. After some checking, I found that each Conv2d operation gives slightly different results using same input data, on the both machines. Here is the example of the Conv2d output:
On machine (i):

tensor([-0.0041, -0.0202, -0.0055, -0.0015, -0.0053,  0.0510, -0.0171, -0.0144,
         0.0384,  0.0162], device='cuda:0', grad_fn=<SliceBackward>)

On machine (ii):

tensor([-0.0049, -0.0200, -0.0050, -0.0019, -0.0054,  0.0477, -0.0157, -0.0146,
         0.0360,  0.0177], device='cuda:0', grad_fn=<SliceBackward>)

I expect that the numbers must be the same in the both machines. Any idea how to solve this issue?

Additional info:

  • Both machine uses same Torch version (1.9.0)
  • Machine (i) uses cuda 11.3, machine (ii) uses cuda 10.2
  • I print the Conv2d weight, and both machines give same numbers

Thanks!

Hi Ardian!

Could this be TF32? Note that your 3090 gpu has TF32, while your
2080 gpu does not, so this is probably the issue.

(@ptrblck – Could you guys consider turning off TF32 by default? TF32
is a nice feature, but recent evidence suggests that enabling it by default
counts as a bug.)

Please see this recent post:

Best.

K. Frank

Could you add your concerns to this RFE so that we could track it, please?