Forward Pass on Conv2d Segfaults

dbalaban · January 30, 2024, 9:54pm

Running on Ubuntu 20.04, with python version 3.10.6, I wrote the following test script to reproduce my error:

import torch
import torch.nn as nn
from torch.optim import SGD

import faulthandler
faulthandler.enable()

print(f"torch version: {torch.__version__}")

x = torch.ones([1,3,10,10]).to("cuda:0")
conv1 = nn.Conv2d(3,4,kernel_size=2,bias=False).to("cuda:0")

opt = SGD(conv1.parameters(),lr=1e-3)

y = conv1.forward(x)
loss = y.sum()

opt.zero_grad()
loss.backward()
opt.step()

which gives me the following output:

torch version: 2.2.0+cu118
Fatal Python error: Segmentation fault

Current thread 0x00007fea02085740 (most recent call first):
  File "/scratch/cluster/dbalaban/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 456 in _conv_forward
  File "/scratch/cluster/dbalaban/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 460 in forward
  File "/scratch/cluster/dbalaban/SemanticLabelPropagation/test.py", line 12 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special (total: 20)
Segmentation fault (core dumped)

I installed pytorch with the following command:

python3 -m pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

my nvcc has the following version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

nvidia gpu information:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40          On   | 00000000:17:00.0 Off |                    0 |
|  0%   42C    P0    82W / 300W |  10643MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A40          On   | 00000000:65:00.0 Off |                    0 |
|  0%   43C    P0    85W / 300W |  10641MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A40          On   | 00000000:CA:00.0 Off |                    0 |
|  0%   39C    P0    77W / 300W |  10131MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A40          On   | 00000000:E3:00.0 Off |                    0 |
|  0%   43C    P0    79W / 300W |  10129MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   1451786      C   ray::execute_config             10638MiB |
|    1   N/A  N/A   1457248      C   ray::execute_config             10636MiB |
|    2   N/A  N/A   1439609      C   ray::execute_config             10124MiB |
|    3   N/A  N/A   1441881      C   ray::execute_config             10124MiB |
+-----------------------------------------------------------------------------+

After that segfault, I tried reinstalling with a different cuda version like so:

python3 -m pip uninstall torch torchvision
python3 -m pip freeze | grep nvidia- | xargs pip uninstall -y
python3 -m pip install torch torchvision

and running the test script again caused a bunch of “libcudnn_cnn_train.so.8: undefined symbol” errors on the backprop. This seems to come from using cuda12 instead of cuda11.8. I cannot update my cuda install as I do not have sudo permissions.

What can I do to fix my python environment to train properly?

Thanks

ptrblck · January 30, 2024, 9:56pm

It seems your locally installed CUDA toolkit (including cuDNN) might be conflicting with the binaries.
Could you remove cuDNN (and other CUDA libs) from the LD_LIBRARY_PATH allowing PyTorch to use its own CUDA dependencies?

dbalaban · January 30, 2024, 10:07pm

Thank you, removing my cuda-related adjustments to the LD_LIBRARY_PATH environment variable did the trick. I’m curious how you were able to identify the conflict?

ptrblck · January 31, 2024, 2:01am

I’ve seen this kind of issue before and the missing symbol gave the right hint.

Brando_Miranda · August 8, 2024, 12:39am

related: machine learning - How to fix Segmentation fault when training GPT-2 model using Hugging Face Transformers? - Stack Overflow