Segmentation fault (Core dump) when using model.cuda

ellenW · May 22, 2021, 2:37pm

Hi, I’m getting a Segmentation Fault when using model.cuda.

Torch version =1.2.0 , gpu Quadro RTX 5000 , Cuda :11.2

Here is output of gdb:

New Thread 0x7fff63ff5700 (LWP 110466)]

Thread 1 “python” received signal SIGSEGV, Segmentation fault.

0x00007ffef9e3faae in ?? () from /lib64/libcuda.so.1

(gdb)

(gdb) where

#0 0x00007ffef9e3faae in ?? () from /lib64/libcuda.so.1

#1 0x00007ffef9e2b2f9 in ?? () from /lib64/libcuda.so.1

#2 0x00007ffef9c4ab7e in ?? () from /lib64/libcuda.so.1

#3 0x00007ffef9cac3a0 in ?? () from /lib64/libcuda.so.1

#4 0x00007ffef9c66b7f in ?? () from /lib64/libcuda.so.1

#5 0x00007ffef9da56ec in cuDevicePrimaryCtxRetain () from /lib64/libcuda.so.1

#6 0x00007fffe35b8d30 in ?? () from …/lib/python3.7/site-packages/torch/lib/libcudart-1581fefa.so.10.0

#7 0x00007fffe35b9832 in ?? () from …/lib/python3.7/site-packages/torch/lib/libcudart-1581fefa.so.10.0

#8 0x00007fffe35ba2e8 in ?? () from …/lib/python3.7/site-packages/torch/lib/libcudart-1581fefa.so.10.0

#9 0x00007fffe35ad43e in ?? () from …./lib/python3.7/site-packages/torch/lib/libcudart-1581fefa.so.10.0

#10 0x00007fffe359cde8 in ?? () from …/lib/python3.7/site-packages/torch/lib/libcudart-1581fefa.so.10.0

#11 0x00007fffe35ce23c in cudaMalloc () from …/lib/python3.7/site-packages/torch/lib/libcudart-1581fefa.so.10.0

ptrblck · May 22, 2021, 9:25pm

Could you update to the latest stable or nightly release and check the code again?
If you are still running into the issue, could you post a minimal code snippet to reproduce the issue and the output of python -m torch.utils.collect_env?

ellenW · May 23, 2021, 5:55pm

I created a new virtual environment and install latest torch version using the following command:

conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch

Unfortunately I encountered the same error again. Server has 4 gpus but I use gpu-0. Gpu-1

and 2 are not appropriate for torch.

Here is output python -m torch.utils.collect_env :

PyTorch version: 1.8.0
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: CentOS Linux release 8.3.2011 (x86_64)
GCC version: (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5)
Clang version: Could not collect
CMake version: version 3.11.4

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Quadro RTX 5000
GPU 1: Tesla K10.G1.8GB
GPU 2: Tesla K10.G1.8GB
GPU 3: Quadro RTX 5000

Nvidia driver version: 460.32.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.2
[pip3] torch==1.8.0
[pip3] torchaudio==0.8.0a0+a751e1d
[pip3] torchvision==0.9.0
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.2.89 hfd86e86_1
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.2.0 h06a4308_296
[conda] mkl-service 2.3.0 py38h27cfd23_1
[conda] mkl_fft 1.3.0 py38h42c9631_2
[conda] mkl_random 1.2.1 py38ha9443f7_2
[conda] numpy 1.20.2 py38h2d18471_0
[conda] numpy-base 1.20.2 py38hfae3a4d_0
[conda] pytorch 1.8.0 py3.8_cuda10.2_cudnn7.6.5_0 pytorch
[conda] torchaudio 0.8.0 py38 pytorch
[conda] torchvision 0.9.0 py38_cu102 pytorch

Minimal code snippet :

#pdb.set_trace()
train_size = len(train_dataset)
print(train_size)
train_batch = data.DataLoader(train_dataset, batch_size=args.batch_size, shuffle=True, num_workers=args.num_workers, drop_last=True)

model = convAE(args.c, args.t_length, args.psize, args.fdim[0], args.pdim[0])
#pdb.set_trace()
model.cuda() -------- Execution of this is problematic.

params_encoder = list(model.encoder.parameters())
params_decoder = list(model.decoder.parameters())

Thank you…

ellenW · May 23, 2021, 6:18pm

I solve the problem ! :)) . I add

device = torch.device(‘cuda’ if torch.cuda.is_available() else ‘cpu’)
print(device)
print(f"running with device: {torch.cuda.get_device_name(torch.cuda.current_device())}"

and change model.cuda() to model.to(device). I think the problem was wrong gpu selection.

iucario · August 29, 2022, 6:21am

I solved my error by using model.to(torch.device('cuda'))