Hello,
I have written some code to implement an architecture idea (composed of Convolutional blocks, Transformer blocks and MLP blocks), and the code trains perfectly well (i.e., smooth loss curve) both on my portable machine (equipped with an NVIDIA GeForce RTX 3080 Laptop GPU) as well as on a VM I have access to (equipped with a GRID V100D-16Q), let’s call this VM1. When I copied this exact same code however to a new VM I got access to, equipped with A100 GPUs - let’s call this one VM2, and ran my training code there, the training wasn’t successful. I made sure that the same code is running with the exact same training data, same hyper-parameters, still training on a single GPU, etc. What was happening more specifically was that the loss started to decrease initially, but after a couple of epochs the loss decrease becomes so slow that it’s almost constant.
Initially, I thought it was a problem of badly setting hyper-parameters, but after trying many configurations I scraped this possibility. Then, my next intuition was that maybe the different Pytorch versions were the culprit here, as I was using PyTorch 1.10 locally and on my current VM1 and only PyTorch 1.11 is supported on the new VM2. So I installed PyTorch 1.11 on VM1, and the training proceeded normally. So it also doesn’t seem to be the version of PyTorch that’s responsible here. I also tried several other test such as setting the following two lines of code in the hope of reducing nondeterminism, but still to no avail:
torch.backends.cudnn.benchmark = False
torch.use_deterministic_algorithms(True, warn_only=True)
And for your reference, here are the differences in the progress of the loss between my local machine (similar to VM1) and the new A100-equipped VM2:
Local Machine (NVIDIA GeForce RTX 3080 Laptop GPU) and current VM1 (GRID V100D-16Q):
12.15, 10.00, 9.18, 8.42, 7.82, 7.38, 7.03, 6.77, 6.56, 6.38
New VM2 (A100 GPU):
14.62, 12.88, 12.51, 12.17, 11.96, 11.76, 11.64, 11.52, 11.47, 11.44
Even after 100 epochs, the loss remains around 8 on this VM (with metrics close to 0 in my case).
So notice that I’m not talking about some simple noise or small difference in performance, I’m talking about going from a perfectly smooth training to no learning at all by simply changing machines.
Would someone have any idea why this might be happening? Could it be the different versions of CUDA? Or could it even be in relation to the difference in CPUs? as the A100 GPUs are set up with AMD CPUs rather than Intel CPUs? Any help or direction would be greatly appreciated.