Pytorch hanging on backward() step on certain hardware

For context, I’m a Research Software Engineer trying to help a researcher at my university diagnose a potential hardware issue they having with their PyTorch code. My background isn’t in AI/ML, so apologies if I get anything wrong or is unclear.

The crux of the problem is that the code hangs when calculating the backward pass during training. It’s only an issue when the batch size is >= 20 on the A100 GPU nodes in our HPC cluster. If we run the code on the H100 or L40 GPU nodes we have available, then we can use batch sizes of > 20 and the code will run fine. It’s only an issue on the A100 nodes over a certain batch size.

At least on the H100 and A100 nodes, the environments are practically identical: same OS, pytorch, CUDA and NVIDIA driver versions. The main difference between the nodes are the CPUs and amount of RAM. The L40 nodes have a slightly newer driver, but are otherwise the same.

The one workaround we’ve found is to use automatic mixed precision on the A100 GPUs. This works on all the hardware we have available. However, we’d still like to get to the bottom of the issue, to understand what’s going on. I don’t believe it’s related to running out of VRAM or something memory because we know it works on the L40 GPUs, which, I believe, has lower memory bandwidth and less VRAM than the A100s and H100s.

Verbose logging shows that the hang is probably occurring during the convolution backward step:

V0926 11:06:49.469000 2039278 torch/autograd/graph.py:803] Executing: <NllLossBackward0 object at 0x7efdca18df00> with grad_outputs: [f32[]]
V0926 11:06:49.594000 2039278 torch/autograd/graph.py:803] Executing: <LogSoftmaxBackward0 object at 0x7efdca503df0> with grad_outputs: [f32[25, 36]]
V0926 11:06:49.595000 2039278 torch/autograd/graph.py:803] Executing: <AddmmBackward0 object at 0x7efdca503df0> with grad_outputs: [f32[25, 36]]
V0926 11:06:49.638000 2039278 torch/autograd/graph.py:803] Executing: <AccumulateGrad object at 0x7efdca503df0> with grad_outputs: [f32[36]]
V0926 11:06:49.638000 2039278 torch/autograd/graph.py:803] Executing: <TBackward0 object at 0x7efdca503df0> with grad_outputs: [f32[4000, 36]]
V0926 11:06:49.639000 2039278 torch/autograd/graph.py:803] Executing: <AccumulateGrad object at 0x7efdca503df0> with grad_outputs: [f32[36, 4000]]
V0926 11:06:49.640000 2039278 torch/autograd/graph.py:803] Executing: <LeakyReluBackward0 object at 0x7efdca503df0> with grad_outputs: [f32[25, 4000]]
V0926 11:06:49.641000 2039278 torch/autograd/graph.py:803] Executing: <NativeBatchNormBackward0 object at 0x7efdca503df0> with grad_outputs: [f32[25, 4000]]
V0926 11:06:49.642000 2039278 torch/autograd/graph.py:803] Executing: <AccumulateGrad object at 0x7efdca503df0> with grad_outputs: [f32[4000]]
V0926 11:06:49.642000 2039278 torch/autograd/graph.py:803] Executing: <AccumulateGrad object at 0x7efdca503df0> with grad_outputs: [f32[4000]]
V0926 11:06:49.643000 2039278 torch/autograd/graph.py:803] Executing: <MmBackward0 object at 0x7efdca503df0> with grad_outputs: [f32[25, 4000]]
V0926 11:06:49.666000 2039278 torch/autograd/graph.py:803] Executing: <TBackward0 object at 0x7efdca503df0> with grad_outputs: [f32[8192, 4000]]
V0926 11:06:49.667000 2039278 torch/autograd/graph.py:803] Executing: <AccumulateGrad object at 0x7efdca503df0> with grad_outputs: [f32[4000, 8192]]
V0926 11:06:49.667000 2039278 torch/autograd/graph.py:803] Executing: <MeanBackward1 object at 0x7efdca503df0> with grad_outputs: [f32[25, 8192]]
V0926 11:06:49.689000 2039278 torch/autograd/graph.py:803] Executing: <UnsafeViewBackward0 object at 0x7efdca503df0> with grad_outputs: [f32[25, 15000, 8192]]
V0926 11:06:49.702000 2039278 torch/autograd/graph.py:803] Executing: <CloneBackward0 object at 0x7efdca503df0> with grad_outputs: [f32[25, 15000, 128, 64]]
V0926 11:06:49.715000 2039278 torch/autograd/graph.py:803] Executing: <TransposeBackward0 object at 0x7efdca503df0> with grad_outputs: [f32[25, 15000, 128, 64]]
V0926 11:06:49.727000 2039278 torch/autograd/graph.py:803] Executing: <CatBackward0 object at 0x7efdca503df0> with grad_outputs: [f32[25, 128, 15000, 64]]
V0926 11:06:49.744000 2039278 torch/autograd/graph.py:803] Executing: <ConvolutionBackward0 object at 0x7efdca503df0> with grad_outputs: [f32[25, 64, 15000, 64]]

When the script hangs, the GPUs and CPUs are at 100% utilisation, so it is unclear to me if it’s a true deadlock or just extremely slow kernel execution. So far, none of them following seems to influence the behaviour:

  • Enabling/disabling pinned memory
  • CUDA_LAUNCH_BLOCKING=1 (occasionally alows one iteration before hanging)
  • Toggling torch.backends.cuda.matmul.allow_tf32 and torch.backends.cudnn.allow_tf32
  • Toggling torch.backends.cuddn.benchmark and torch.use_deterministic_algorithms
  • Disabling cuDNN leads to an OOM (so diagnostics are limited if this actually helps or not)

At this point, my lack of knowledge how this all hangs together is stopping me from progressing any further . I’m unsure whether we are hitting a cuDNN kernel issue specific to the A100s or something else entirely.

Any guidance on how to further diagnose this issue would be appreciated!

Hey @Edward-RSE ,

Not sure if this is the real problem or not, but maybe this is a “performance” issue due to the different types of GPUs. Specifically, even though one uses identical versions for CUDA and PyTorch, the difference in GPU architecture might play a role in the performance behavior of the model during training.

Since convolutions typically involve GEMMs, the parameters of a convolution operation might impact the performance. As a matter of fact, NVIDIA does provide some insight into how the choice of parameters (e.g., batch size, height, width) affects performance.

One such performance issue is related to inefficient tiling and they give several advices:

Choose parameters (batch size, number of input and output channels) to be divisible by at least 64 and ideally 256 to enable efficient tiling and reduce overhead

Larger values for size-related parameters (batch size, input and output height and width, and the number of input and output channels) can improve parallelization. As with fully-connected layers, this speeds up an operation’s efficiency, but does not reduce its absolute duration; see How Convolution Parameters Affect Performance and subsections.

Moreover, here you can see a guideline for choosing parameters depending on the number of tensor cores (including explicit values for A100).

So you should double check with your team if the choice of parameters of the convolutional layers within the model and the batch size does follow the cited resources.

I hope this can help.