Cpu vs gpu (cuda) segmentation results difference

Hello PyTorch Community,
I am inferring a pre-trained model trained on some segmentation task.
If I evaluate cuda (GPU) (on my GPUs cluster), I get different results than if I do it on my local machine on Windows CPU.
Here are some visual results; I found two reasons behind this, but the visual results should vary that much; not sure.

  • Floating point precisions
  • Difference in the execution order of the operations

Any clue will be appreciated.
Thank You

This is potentially expected, especially if the model was trained on CPU and then moved to GPU. If the GPUs are Ampere (sm80) or newer, you might want to check if setting NVIDIA_TF32_OVERRIDE=0 changes the results somewhat (this env var turns off the use of the TF32 dtype internally).

If there is a serious degradation of accuracy and the above env var doesn’t meaningfully improve your results, I would check the layers used by your model one-by-one to see where the differences are coming from.

Hello, thanks for your reply.
The model is trained on GPU and then tested on both CPU and GPU.
The GPU is A100.
Thanks for the suggestions, will try it.