Tensor and model .cuda() hanging indefinitely - Have to crash kernel or force close terminal

3210jr · October 13, 2019, 1:25pm

Hi all,

Whenever I try to move my tensors or model to the GPU using either the .cuda() or .to('cuda') method, the kernel just freezes and has to be terminated to be used again.

I’ve looked through several other related issues and they are either extremely old (circa 2017) or their solutions were that they had an incompatible version of cuda running - I think none were very useful.

Here is my environment details:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 435.21       Driver Version: 435.21       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 207...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   47C    P8     5W /  N/A |    303MiB /  7982MiB |      5%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1768      G   /usr/lib/xorg/Xorg                           135MiB |
|    0      2036      G   /usr/bin/gnome-shell                         116MiB |
|    0      2589      G   ...uest-channel-token=11750413998548151078    49MiB |
+-----------------------------------------------------------------------------+

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Fri_Feb__8_19:08:17_PST_2019
Cuda compilation tools, release 10.1, V10.1.105

And I just followed the basic installation instructions on the website:

conda install pytorch torchvision cudatoolkit=10.1 -c pytorch

Any ideas? Thanks!

tom · October 13, 2019, 2:41pm

There was an issue with cuda 10.1 minor versions between anaconda’s cuda and PyTorch’s cuda differing causing excessive JIT compiles.
It is fixed and reinstalling PyTorch helps.

Best regards

Thomas

3210jr · October 13, 2019, 3:24pm

Worked like a charm!

Thanks @tom