Different training behavior on different machines


I have written some code to implement an architecture idea (composed of Convolutional blocks, Transformer blocks and MLP blocks), and the code trains perfectly well (i.e., smooth loss curve) both on my portable machine (equipped with an NVIDIA GeForce RTX 3080 Laptop GPU) as well as on a VM I have access to (equipped with a GRID V100D-16Q), let’s call this VM1. When I copied this exact same code however to a new VM I got access to, equipped with A100 GPUs - let’s call this one VM2, and ran my training code there, the training wasn’t successful. I made sure that the same code is running with the exact same training data, same hyper-parameters, still training on a single GPU, etc. What was happening more specifically was that the loss started to decrease initially, but after a couple of epochs the loss decrease becomes so slow that it’s almost constant.

Initially, I thought it was a problem of badly setting hyper-parameters, but after trying many configurations I scraped this possibility. Then, my next intuition was that maybe the different Pytorch versions were the culprit here, as I was using PyTorch 1.10 locally and on my current VM1 and only PyTorch 1.11 is supported on the new VM2. So I installed PyTorch 1.11 on VM1, and the training proceeded normally. So it also doesn’t seem to be the version of PyTorch that’s responsible here. I also tried several other test such as setting the following two lines of code in the hope of reducing nondeterminism, but still to no avail:

torch.backends.cudnn.benchmark = False
torch.use_deterministic_algorithms(True, warn_only=True)

And for your reference, here are the differences in the progress of the loss between my local machine (similar to VM1) and the new A100-equipped VM2:

Local Machine (NVIDIA GeForce RTX 3080 Laptop GPU) and current VM1 (GRID V100D-16Q):
12.15, 10.00, 9.18, 8.42, 7.82, 7.38, 7.03, 6.77, 6.56, 6.38
New VM2 (A100 GPU):
14.62, 12.88, 12.51, 12.17, 11.96, 11.76, 11.64, 11.52, 11.47, 11.44
Even after 100 epochs, the loss remains around 8 on this VM (with metrics close to 0 in my case).

So notice that I’m not talking about some simple noise or small difference in performance, I’m talking about going from a perfectly smooth training to no learning at all by simply changing machines.

Would someone have any idea why this might be happening? Could it be the different versions of CUDA? Or could it even be in relation to the difference in CPUs? as the A100 GPUs are set up with AMD CPUs rather than Intel CPUs? Any help or direction would be greatly appreciated.

Hi Joseph!

This could well be due to the TensorFloat-32 reduced-precision floating-point
format supported on Ampere GPUs. Prior to pytorch version 1.12, this was
silently enabled by default. For version 1.12 it is (silently) enabled by default
for convolutions, but not for explicit matmul()s.

This reduced precision is not supposed to screw up training (and does
speed things up), but, depending on your use case, it possibly could.

Try setting torch.backends.cuda.matmul.allow_tf32 and / or
torch.backends.cudnn.allow_tf32 to False and see if that fixes
training on your A100s.


K. Frank

Hello @KFrank !

First off, thank you for taking the time to comment on my issue!

That’s an interesting hypothesis and I tried setting torch.backends.cudnn.allow_tf32 to False as suggested, unfortunately however this doesn’t seem to affect the problematic training behavior at all.


Joseph Assaker.

Hi Joseph!

Did you also try setting torch.backends.cuda.matmul.allow_tf32 to False?


K. Frank

Hello @KFrank ,

I just tried setting torch.backends.cuda.matmul.allow_tf32 also to False, but unfortunately it also didn’t affect the training.

Thanks again for your intention to help!


Joseph Assaker.

Hi Joseph!

Okay, it sounds like there could be some problem here.

Just to confirm, your VM1 and VM2 are running the same version of pytorch
and get non-trivially different results running the same script, even with both
allow_tf32 flags set to False. Correct?

Narrow things down to the simplest computation that differs on the two systems
by more than floating-point round-off error. This could be a single forward pass
or even just an example matmul(). (Make sure that you initialize any random
number seeds to the same well-defined values on both systems.)

The best would be if you could create a small, fully-self-contained, runnable
script that reproduces the issue, together with its output from both VM1 and

Also, post the results of python -m torch.utils.collect_env for both of
the systems to see if the experts might spot something fishy.


K. Frank

That’s exactly what I’ll try to do next.

Here is the output on my portable machine (code works fine):

Collecting environment information...
PyTorch version: 1.10.2+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Pro
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.6.8 (tags/v3.6.8:3c6b436a57, Dec 24 2018, 00:16:47) [MSC v.1916 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19041-SP0
Is CUDA available: True
CUDA runtime version: 11.6.112
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3080 Laptop GPU
Nvidia driver version: 512.36
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.17.4
[pip3] torch==1.10.2+cu113
[pip3] torchaudio==0.10.2+cu113
[pip3] torchvision==0.11.3+cu113
[conda] Could not collect

Here is the output produced on VM1 (code works fine):

Collecting environment information...
PyTorch version: 1.11.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.8.10 (default, Mar 15 2022, 12:22:08)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.4.0-110-generic-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: GRID V100D-16Q
Nvidia driver version: 470.82.01
cuDNN version: Probably one of the following:
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.23.1
[pip3] torch==1.11.0
[pip3] torchvision==0.12.0
[conda] Could not collect

And finally, here is the output produced on VM2 (where the code doesn’t work as expected):

Collecting environment information...
PyTorch version: 1.11.0
Is debug build: False
CUDA used to build PyTorch: 11.2
ROCM used to build PyTorch: N/A

OS: Red Hat Enterprise Linux release 8.4 (Ootpa) (x86_64)
GCC version: (GCC) 8.4.1 20200928 (Red Hat 8.4.1-1)
Clang version: Could not collect
CMake version: version 3.23.1
Libc version: glibc-2.28

Python version: 3.9.12 | packaged by conda-forge | (main, Mar 24 2022, 23:25:59)  [GCC 10.3.0] (64-bit runtime)
Python platform: Linux-4.18.0-305.57.1.el8_4.x86_64-x86_64-with-glibc2.28
Is CUDA available: True
CUDA runtime version: 11.2.67
GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB
Nvidia driver version: 510.73.08
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] efficientnet-pytorch==0.6.3
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.2
[pip3] nvidia-dlprof-pytorch-nvtx==1.8.0
[pip3] pytorch-fast-transformers==0.4.0
[pip3] pytorch-ignite==0.4.8
[pip3] pytorch-lightning==1.6.1
[pip3] pytorch-msssim==0.2.1
[pip3] pytorch-pfn-extras==0.5.8
[pip3] pytorch3d==0.7.0
[pip3] pytorchvideo==0.1.5
[pip3] segmentation-models-pytorch==0.2.1
[pip3] torch==1.11.0
[pip3] torch-cluster==1.6.0
[pip3] torch-geometric==2.0.4
[pip3] torch-points-kernels==0.6.10
[pip3] torch-scatter==2.0.9
[pip3] torch-sparse==0.6.13
[pip3] torch-spline-conv==1.2.1
[pip3] torch-tb-profiler==0.4.0
[pip3] torchaudio==0.11.0+820b383
[pip3] torchio==0.18.76
[pip3] torchmetrics==0.8.0
[pip3] torchsparse==1.4.0
[pip3] torchtext==0.12.0a0+d7a34d6
[pip3] torchvision==0.12.0a0
[pip3] torchviz==0.0.2
[conda] efficientnet-pytorch      0.6.3                    pypi_0    pypi
[conda] mypy-extensions           0.4.3                    pypi_0    pypi
[conda] numpy                     1.21.2           py39hdbf815f_0    conda-forge
[conda] nvidia-dlprof-pytorch-nvtx 1.8.0                    pypi_0    pypi
[conda] pytorch-fast-transformers 0.4.0                    pypi_0    pypi
[conda] pytorch-ignite            0.4.8                    pypi_0    pypi
[conda] pytorch-lightning         1.6.1                    pypi_0    pypi
[conda] pytorch-msssim            0.2.1                    pypi_0    pypi
[conda] pytorch-pfn-extras        0.5.8                    pypi_0    pypi
[conda] pytorch3d                 0.7.0                    pypi_0    pypi
[conda] pytorchvideo              0.1.5                    pypi_0    pypi
[conda] segmentation-models-pytorch 0.2.1                    pypi_0    pypi
[conda] torch                     1.11.0                   pypi_0    pypi
[conda] torch-cluster             1.6.0                    pypi_0    pypi
[conda] torch-geometric           2.0.4                    pypi_0    pypi
[conda] torch-points-kernels      0.6.10                   pypi_0    pypi
[conda] torch-scatter             2.0.9                    pypi_0    pypi
[conda] torch-sparse              0.6.13                   pypi_0    pypi
[conda] torch-spline-conv         1.2.1                    pypi_0    pypi
[conda] torch-tb-profiler         0.4.0                    pypi_0    pypi
[conda] torchaudio                0.11.0+820b383           pypi_0    pypi
[conda] torchio                   0.18.76                  pypi_0    pypi
[conda] torchmetrics              0.8.0                    pypi_0    pypi
[conda] torchsparse               1.4.0                    pypi_0    pypi
[conda] torchtext                 0.12.0a0+d7a34d6          pypi_0    pypi
[conda] torchvision               0.12.0a0                 pypi_0    pypi
[conda] torchviz                  0.0.2                    pypi_0    pypi

If anyone can spot something worthy to try out from the produced output, please let me know!

Joseph Assaker.

I am experiencing the same on a very similar setup. In my case:

VM1 = RTX 3080, with pytorch=1.9, cuda=11.0 (Windows)
VM2 = A100, with pytorch=1.13, cuda=11.8 (Linux)

Same code, data, hyperparams produce different training curves. VM2 being the worst.

In pytorch I checked weight initalization and optimizer defaults across the two versions and found that they are the same.

Is it the OS? Cuda versions?

This issue might be more critical than it looks like, many people work cross-platform when scaling up their projects.

It could be anything since you are using entirely different setups, so try to align at least some of the libraries to narrow down the issue further.