Hi,
I trained a model, and test the model in different GPU machine: (i) NVIDIA GeForce RTX 3090 and (ii) GeForce RTX 2080 Ti. However, I got different accuracy results in both machine. Note that I use exactly same code base and same trained model, only evaluation mode is performed here. After some checking, I found that each Conv2d operation gives slightly different results using same input data, on the both machines. Here is the example of the Conv2d output:
On machine (i):
tensor([-0.0041, -0.0202, -0.0055, -0.0015, -0.0053, 0.0510, -0.0171, -0.0144,
0.0384, 0.0162], device='cuda:0', grad_fn=<SliceBackward>)
On machine (ii):
tensor([-0.0049, -0.0200, -0.0050, -0.0019, -0.0054, 0.0477, -0.0157, -0.0146,
0.0360, 0.0177], device='cuda:0', grad_fn=<SliceBackward>)
I expect that the numbers must be the same in the both machines. Any idea how to solve this issue?
Additional info:
- Both machine uses same Torch version (1.9.0)
- Machine (i) uses cuda 11.3, machine (ii) uses cuda 10.2
- I print the Conv2d weight, and both machines give same numbers
Thanks!