Training crashes with no error when using CUDA

Jakkes · April 3, 2020, 9:39pm

I’m having a very weird issue when training my model in PyTorch. When using CPU, it works flawlessly. However, when switching over to CUDA, the program crashes with no error reported. I’m running it in this Google Colab Notebook and at a specific point in time the notebook just crashes “unexpectedly” with not error reported whatsoever.

I realize that anybody clicking that link will be deterred of helping me. However, I have pretty much no clue what the issue might be.

Do you have any idea of common issues that causes a crash with no error?

I have managed to narrow it down to one specific point in time. In the second to last cell, I have my main training loop. In order to better capture print messages (?) I put a sleep in there at the end of each loop. The program crashes after the sleep (never wakes up) when my lengths.max() == 2 - and I have no idea why.

Does the sleep behave weirdly when using CUDA (time.sleep)?

Thank you for you help

ptrblck · April 4, 2020, 7:30am

Could you post a small executable code snippet?
I guess the real error message is hidden by the Jupyter Kernel crash/restart.
Running the code in a terminal should give you an error message.

Jakkes · April 4, 2020, 9:43am

Hi,

Unfortunately I have no GPU, and so I am unable to do so.

However, I did manage to pin point the issue and solve it. Apparently, passing a batch of size 0 through a GRUCell doesn’t work on CUDA, but it does on CPU…? Either way, adding a flag in my code to check for the batch size of the input to the GRUCell fixed the issue.

ptrblck · April 5, 2020, 3:05am

Could you please post a small code snippet demonstrating this behavior?
It should work or raise an issue on both devices.

Jakkes · April 5, 2020, 12:17pm

Sure, here you go. The following crashes when device is set to “cuda”, but runs fine with device set to “cpu”.

import torch
from torch import nn

device = "cpu"

grucell = nn.GRUCell(32, 32).to(device)

hidden_state = torch.rand(0, 32).to(device)
input_data = torch.rand(0, 32).to(device)

output_state = grucell(input_data, hidden_state)

print(output_state)

ptrblck · April 5, 2020, 10:17pm

The code works using the nightly binary from ~1 week ago, so this issue might have been already fixed.
Could you install the nightly binary (in a new virtual environment) and rerun the code, please?

Jakkes · April 6, 2020, 11:16am

Sorry, but I am unable as I do not own a CUDA device myself. I can only run it through Google Colab.

ptrblck · April 6, 2020, 7:45pm

You should still be able to install it in the notebook via !pip install ...

Jakkes · April 14, 2020, 9:19am

Sorry for the late reply. Yes, using nightly it works. Using 1.4.0, it does not.