Different inference results across CUDA computing architectures

Hello!

I am experiencing different inference results across different computing architectures, such as 2080TI and 3060 having computing capabilities of 7.5 and 8.6, accordingly. I am using torch 1.13.1+cu116. The model uses also some external codes, such as in detectron2.

What could be the reason for this? Can the underlying math operations differ somehow w.r.t computing architecture and if so, can I somehow try to minimize the differences? Can docker cause discrepancies and/or non-determinism between different builds? Can the same built docker image, used in different machines, cause non-determinism?

Could steps taken from here solve this behaviour?

In both cases, torch.__config__.show() gives identical printout and the nvidia drivers are also the same.

Thanks!

You might want to check if TF32 is enabled and could disable it on your 3090 via:

torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.allow_tf32 = False

if needed for a potential slowdown.

Thank you for the quick reply @ptrblck!

The model shows expected behaviour with 30xx GPUs, but not with the 2080. In that case, is that tip still valid?

Thanks!

Your 2080 does not support TF32 so you won’t be able to enable/disable it.
However, could you explain what “expected” behavior means in this context?
Do you see the same results on both devices if you disable TF32 on the 3090 and if so, are these results “worse” in your metric?

Thank you for the tips. Through visualizations with a couple of samples, the model on 2080 seemed to produce a lot less predictions, but sometimes many overlapping ones, whereas on 3060 the model was accurate and produced more sensible results. But I’ll do some more quantitative comparisons of prediction results with annotated data to see how much they differ in practice as well as try some remedies accordingly.

Please keep me updated as it could also be entirely unrelated to the TF32 numerical format and could indicate a bug visible on the 2080.

Hello @ptrblck,

Update: I have been experimenting with RTX 3090 and RTX 2080 Ti. The model gives quantitatively consistent predictions GPU-wise, but the predictions differ a lot across these GPUs. RTX 3090 demonstrates expected and desired behaviour but with RTX 2080 Ti, there are usually dozens of more predictions, in some cases even hundreds of more. Will continue to investigate. We are seeing that compute capability of >=8.0 is working well.

The model was trained on RTX 3090.

Also if on 3090 I configure:

torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.allow_tf32 = False

there are only minor differences on the prediction results (before was False and True, accordingly).

Hello @ptrblck,

When the custom CUDA/C++ extension is built, the code is generated targeting the underlying architecture (e.g. -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75). So the build on 3090 did not work as expected on 2080 Ti. After rebuilding on 2080 Ti, everything works.

Thank you for your time and suggestions.

Oh, that’s interesting as I didn’t realize you are using custom CUDA code.
I would also expect to see kernel launch errors if the binary isn’t compatible, but it seems your 2080Ti was more than happy to just execute the code? :confused:

Yes, also Tesla M60 executed without errors. :sweat_smile: