The different CUDA memory and speed between torch-1.7.1+cuda 11.0 and torch-1.8.0 + cuda 11.1

Hi, Pytorch’s team, now I use my own Pytorch code to train my model with 8 A100 (80G), which contains 14 layers of dilated convolution and 2 layers of graph transformer, but I found that the cuda memory and speed are different when using different Pytorch versions. For the torch-1.7.1 + cuda 11.0, each device can have 6 batch sizes with 55 GB, but it needs 5 hours to finish an epoch, the torch-1.8.0 + cuda 11.1 can only have 4 batch sizes per device with 69 GB, but it only needs 40 mins to finish an epoch. So, Why does this happen? It feels very strange.

Both PyTorch releases are old so use the latest stable or nightly release and compare your workload against it.

but, when I use the torch-1.10.1 + cuda 11.1 and compare it with the torch-1.7.1 + cuda 11.0, the phenomenon still appears, so do you have any suggestions?

Yes, profile your workload with the current stable release (2.1.2) or a nightly build.

ok, thanks, I will have a try