Forward Pass on Conv2d Segfaults

Running on Ubuntu 20.04, with python version 3.10.6, I wrote the following test script to reproduce my error:

import torch
import torch.nn as nn
from torch.optim import SGD

import faulthandler
faulthandler.enable()

print(f"torch version: {torch.__version__}")

x = torch.ones([1,3,10,10]).to("cuda:0")
conv1 = nn.Conv2d(3,4,kernel_size=2,bias=False).to("cuda:0")

opt = SGD(conv1.parameters(),lr=1e-3)

y = conv1.forward(x)
loss = y.sum()

opt.zero_grad()
loss.backward()
opt.step()

which gives me the following output:

torch version: 2.2.0+cu118
Fatal Python error: Segmentation fault

Current thread 0x00007fea02085740 (most recent call first):
  File "/scratch/cluster/dbalaban/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 456 in _conv_forward
  File "/scratch/cluster/dbalaban/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 460 in forward
  File "/scratch/cluster/dbalaban/SemanticLabelPropagation/test.py", line 12 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special (total: 20)
Segmentation fault (core dumped)

I installed pytorch with the following command:

python3 -m pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

my nvcc has the following version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

nvidia gpu information:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40          On   | 00000000:17:00.0 Off |                    0 |
|  0%   42C    P0    82W / 300W |  10643MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A40          On   | 00000000:65:00.0 Off |                    0 |
|  0%   43C    P0    85W / 300W |  10641MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A40          On   | 00000000:CA:00.0 Off |                    0 |
|  0%   39C    P0    77W / 300W |  10131MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A40          On   | 00000000:E3:00.0 Off |                    0 |
|  0%   43C    P0    79W / 300W |  10129MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   1451786      C   ray::execute_config             10638MiB |
|    1   N/A  N/A   1457248      C   ray::execute_config             10636MiB |
|    2   N/A  N/A   1439609      C   ray::execute_config             10124MiB |
|    3   N/A  N/A   1441881      C   ray::execute_config             10124MiB |
+-----------------------------------------------------------------------------+

After that segfault, I tried reinstalling with a different cuda version like so:

python3 -m pip uninstall torch torchvision
python3 -m pip freeze | grep nvidia- | xargs pip uninstall -y
python3 -m pip install torch torchvision

and running the test script again caused a bunch of “libcudnn_cnn_train.so.8: undefined symbol” errors on the backprop. This seems to come from using cuda12 instead of cuda11.8. I cannot update my cuda install as I do not have sudo permissions.

What can I do to fix my python environment to train properly?

Thanks

It seems your locally installed CUDA toolkit (including cuDNN) might be conflicting with the binaries.
Could you remove cuDNN (and other CUDA libs) from the LD_LIBRARY_PATH allowing PyTorch to use its own CUDA dependencies?

Thank you, removing my cuda-related adjustments to the LD_LIBRARY_PATH environment variable did the trick. I’m curious how you were able to identify the conflict?

I’ve seen this kind of issue before and the missing symbol gave the right hint.

related: machine learning - How to fix Segmentation fault when training GPT-2 model using Hugging Face Transformers? - Stack Overflow