For context, I’m a Research Software Engineer trying to help a researcher at my university diagnose a potential hardware issue they having with their PyTorch code. My background isn’t in AI/ML, so apologies if I get anything wrong or is unclear.
The crux of the problem is that the code hangs when calculating the backward pass during training. It’s only an issue when the batch size is >= 20 on the A100 GPU nodes in our HPC cluster. If we run the code on the H100 or L40 GPU nodes we have available, then we can use batch sizes of > 20 and the code will run fine. It’s only an issue on the A100 nodes over a certain batch size.
At least on the H100 and A100 nodes, the environments are practically identical: same OS, pytorch, CUDA and NVIDIA driver versions. The main difference between the nodes are the CPUs and amount of RAM. The L40 nodes have a slightly newer driver, but are otherwise the same.
The one workaround we’ve found is to use automatic mixed precision on the A100 GPUs. This works on all the hardware we have available. However, we’d still like to get to the bottom of the issue, to understand what’s going on. I don’t believe it’s related to running out of VRAM or something memory because we know it works on the L40 GPUs, which, I believe, has lower memory bandwidth and less VRAM than the A100s and H100s.
Verbose logging shows that the hang is probably occurring during the convolution backward step:
V0926 11:06:49.469000 2039278 torch/autograd/graph.py:803] Executing: <NllLossBackward0 object at 0x7efdca18df00> with grad_outputs: [f32[]]
V0926 11:06:49.594000 2039278 torch/autograd/graph.py:803] Executing: <LogSoftmaxBackward0 object at 0x7efdca503df0> with grad_outputs: [f32[25, 36]]
V0926 11:06:49.595000 2039278 torch/autograd/graph.py:803] Executing: <AddmmBackward0 object at 0x7efdca503df0> with grad_outputs: [f32[25, 36]]
V0926 11:06:49.638000 2039278 torch/autograd/graph.py:803] Executing: <AccumulateGrad object at 0x7efdca503df0> with grad_outputs: [f32[36]]
V0926 11:06:49.638000 2039278 torch/autograd/graph.py:803] Executing: <TBackward0 object at 0x7efdca503df0> with grad_outputs: [f32[4000, 36]]
V0926 11:06:49.639000 2039278 torch/autograd/graph.py:803] Executing: <AccumulateGrad object at 0x7efdca503df0> with grad_outputs: [f32[36, 4000]]
V0926 11:06:49.640000 2039278 torch/autograd/graph.py:803] Executing: <LeakyReluBackward0 object at 0x7efdca503df0> with grad_outputs: [f32[25, 4000]]
V0926 11:06:49.641000 2039278 torch/autograd/graph.py:803] Executing: <NativeBatchNormBackward0 object at 0x7efdca503df0> with grad_outputs: [f32[25, 4000]]
V0926 11:06:49.642000 2039278 torch/autograd/graph.py:803] Executing: <AccumulateGrad object at 0x7efdca503df0> with grad_outputs: [f32[4000]]
V0926 11:06:49.642000 2039278 torch/autograd/graph.py:803] Executing: <AccumulateGrad object at 0x7efdca503df0> with grad_outputs: [f32[4000]]
V0926 11:06:49.643000 2039278 torch/autograd/graph.py:803] Executing: <MmBackward0 object at 0x7efdca503df0> with grad_outputs: [f32[25, 4000]]
V0926 11:06:49.666000 2039278 torch/autograd/graph.py:803] Executing: <TBackward0 object at 0x7efdca503df0> with grad_outputs: [f32[8192, 4000]]
V0926 11:06:49.667000 2039278 torch/autograd/graph.py:803] Executing: <AccumulateGrad object at 0x7efdca503df0> with grad_outputs: [f32[4000, 8192]]
V0926 11:06:49.667000 2039278 torch/autograd/graph.py:803] Executing: <MeanBackward1 object at 0x7efdca503df0> with grad_outputs: [f32[25, 8192]]
V0926 11:06:49.689000 2039278 torch/autograd/graph.py:803] Executing: <UnsafeViewBackward0 object at 0x7efdca503df0> with grad_outputs: [f32[25, 15000, 8192]]
V0926 11:06:49.702000 2039278 torch/autograd/graph.py:803] Executing: <CloneBackward0 object at 0x7efdca503df0> with grad_outputs: [f32[25, 15000, 128, 64]]
V0926 11:06:49.715000 2039278 torch/autograd/graph.py:803] Executing: <TransposeBackward0 object at 0x7efdca503df0> with grad_outputs: [f32[25, 15000, 128, 64]]
V0926 11:06:49.727000 2039278 torch/autograd/graph.py:803] Executing: <CatBackward0 object at 0x7efdca503df0> with grad_outputs: [f32[25, 128, 15000, 64]]
V0926 11:06:49.744000 2039278 torch/autograd/graph.py:803] Executing: <ConvolutionBackward0 object at 0x7efdca503df0> with grad_outputs: [f32[25, 64, 15000, 64]]
When the script hangs, the GPUs and CPUs are at 100% utilisation, so it is unclear to me if it’s a true deadlock or just extremely slow kernel execution. So far, none of them following seems to influence the behaviour:
- Enabling/disabling pinned memory
CUDA_LAUNCH_BLOCKING=1(occasionally alows one iteration before hanging)- Toggling
torch.backends.cuda.matmul.allow_tf32andtorch.backends.cudnn.allow_tf32 - Toggling
torch.backends.cuddn.benchmarkandtorch.use_deterministic_algorithms - Disabling cuDNN leads to an OOM (so diagnostics are limited if this actually helps or not)
At this point, my lack of knowledge how this all hangs together is stopping me from progressing any further . I’m unsure whether we are hitting a cuDNN kernel issue specific to the A100s or something else entirely.
Any guidance on how to further diagnose this issue would be appreciated!