Running on Ubuntu 20.04, with python version 3.10.6, I wrote the following test script to reproduce my error:
import torch
import torch.nn as nn
from torch.optim import SGD
import faulthandler
faulthandler.enable()
print(f"torch version: {torch.__version__}")
x = torch.ones([1,3,10,10]).to("cuda:0")
conv1 = nn.Conv2d(3,4,kernel_size=2,bias=False).to("cuda:0")
opt = SGD(conv1.parameters(),lr=1e-3)
y = conv1.forward(x)
loss = y.sum()
opt.zero_grad()
loss.backward()
opt.step()
which gives me the following output:
torch version: 2.2.0+cu118
Fatal Python error: Segmentation fault
Current thread 0x00007fea02085740 (most recent call first):
File "/scratch/cluster/dbalaban/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 456 in _conv_forward
File "/scratch/cluster/dbalaban/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 460 in forward
File "/scratch/cluster/dbalaban/SemanticLabelPropagation/test.py", line 12 in <module>
Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special (total: 20)
Segmentation fault (core dumped)
I installed pytorch with the following command:
python3 -m pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
my nvcc has the following version:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
nvidia gpu information:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A40 On | 00000000:17:00.0 Off | 0 |
| 0% 42C P0 82W / 300W | 10643MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A40 On | 00000000:65:00.0 Off | 0 |
| 0% 43C P0 85W / 300W | 10641MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A40 On | 00000000:CA:00.0 Off | 0 |
| 0% 39C P0 77W / 300W | 10131MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A40 On | 00000000:E3:00.0 Off | 0 |
| 0% 43C P0 79W / 300W | 10129MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1451786 C ray::execute_config 10638MiB |
| 1 N/A N/A 1457248 C ray::execute_config 10636MiB |
| 2 N/A N/A 1439609 C ray::execute_config 10124MiB |
| 3 N/A N/A 1441881 C ray::execute_config 10124MiB |
+-----------------------------------------------------------------------------+
After that segfault, I tried reinstalling with a different cuda version like so:
python3 -m pip uninstall torch torchvision
python3 -m pip freeze | grep nvidia- | xargs pip uninstall -y
python3 -m pip install torch torchvision
and running the test script again caused a bunch of “libcudnn_cnn_train.so.8: undefined symbol” errors on the backprop. This seems to come from using cuda12 instead of cuda11.8. I cannot update my cuda install as I do not have sudo permissions.
What can I do to fix my python environment to train properly?
Thanks