Training crashes with no error when using CUDA

I’m having a very weird issue when training my model in PyTorch. When using CPU, it works flawlessly. However, when switching over to CUDA, the program crashes with no error reported. I’m running it in this Google Colab Notebook and at a specific point in time the notebook just crashes “unexpectedly” with not error reported whatsoever.

I realize that anybody clicking that link will be deterred of helping me. However, I have pretty much no clue what the issue might be.

  • Do you have any idea of common issues that causes a crash with no error?

I have managed to narrow it down to one specific point in time. In the second to last cell, I have my main training loop. In order to better capture print messages (?) I put a sleep in there at the end of each loop. The program crashes after the sleep (never wakes up) when my lengths.max() == 2 - and I have no idea why.

  • Does the sleep behave weirdly when using CUDA (time.sleep)?

Thank you for you help

Could you post a small executable code snippet?
I guess the real error message is hidden by the Jupyter Kernel crash/restart.
Running the code in a terminal should give you an error message.

Hi,

Unfortunately I have no GPU, and so I am unable to do so.

However, I did manage to pin point the issue and solve it. Apparently, passing a batch of size 0 through a GRUCell doesn’t work on CUDA, but it does on CPU…? Either way, adding a flag in my code to check for the batch size of the input to the GRUCell fixed the issue.

Could you please post a small code snippet demonstrating this behavior?
It should work or raise an issue on both devices.

Sure, here you go. The following crashes when device is set to “cuda”, but runs fine with device set to “cpu”.

import torch
from torch import nn

device = "cpu"

grucell = nn.GRUCell(32, 32).to(device)

hidden_state = torch.rand(0, 32).to(device)
input_data = torch.rand(0, 32).to(device)

output_state = grucell(input_data, hidden_state)

print(output_state)

The code works using the nightly binary from ~1 week ago, so this issue might have been already fixed.
Could you install the nightly binary (in a new virtual environment) and rerun the code, please?

Sorry, but I am unable as I do not own a CUDA device myself. I can only run it through Google Colab.

You should still be able to install it in the notebook via !pip install ...

Sorry for the late reply. Yes, using nightly it works. Using 1.4.0, it does not.