I am experiencing different inference results across different computing architectures, such as 2080TI and 3060 having computing capabilities of 7.5 and 8.6, accordingly. I am using torch 1.13.1+cu116. The model uses also some external codes, such as in detectron2.
What could be the reason for this? Can the underlying math operations differ somehow w.r.t computing architecture and if so, can I somehow try to minimize the differences? Can docker cause discrepancies and/or non-determinism between different builds? Can the same built docker image, used in different machines, cause non-determinism?
Your 2080 does not support TF32 so you won’t be able to enable/disable it.
However, could you explain what “expected” behavior means in this context?
Do you see the same results on both devices if you disable TF32 on the 3090 and if so, are these results “worse” in your metric?
Thank you for the tips. Through visualizations with a couple of samples, the model on 2080 seemed to produce a lot less predictions, but sometimes many overlapping ones, whereas on 3060 the model was accurate and produced more sensible results. But I’ll do some more quantitative comparisons of prediction results with annotated data to see how much they differ in practice as well as try some remedies accordingly.
Update: I have been experimenting with RTX 3090 and RTX 2080 Ti. The model gives quantitatively consistent predictions GPU-wise, but the predictions differ a lot across these GPUs. RTX 3090 demonstrates expected and desired behaviour but with RTX 2080 Ti, there are usually dozens of more predictions, in some cases even hundreds of more. Will continue to investigate. We are seeing that compute capability of >=8.0 is working well.
When the custom CUDA/C++ extension is built, the code is generated targeting the underlying architecture (e.g. -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75). So the build on 3090 did not work as expected on 2080 Ti. After rebuilding on 2080 Ti, everything works.
Oh, that’s interesting as I didn’t realize you are using custom CUDA code.
I would also expect to see kernel launch errors if the binary isn’t compatible, but it seems your 2080Ti was more than happy to just execute the code?