I’m having a very weird issue when training my model in PyTorch. When using CPU, it works flawlessly. However, when switching over to CUDA, the program crashes with no error reported. I’m running it in this Google Colab Notebook and at a specific point in time the notebook just crashes “unexpectedly” with not error reported whatsoever.
I realize that anybody clicking that link will be deterred of helping me. However, I have pretty much no clue what the issue might be.
Do you have any idea of common issues that causes a crash with no error?
I have managed to narrow it down to one specific point in time. In the second to last cell, I have my main training loop. In order to better capture print messages (?) I put a sleep in there at the end of each loop. The program crashes after the sleep (never wakes up) when my lengths.max() == 2 - and I have no idea why.
Does the sleep behave weirdly when using CUDA (time.sleep)?
Could you post a small executable code snippet?
I guess the real error message is hidden by the Jupyter Kernel crash/restart.
Running the code in a terminal should give you an error message.
Unfortunately I have no GPU, and so I am unable to do so.
However, I did manage to pin point the issue and solve it. Apparently, passing a batch of size 0 through a GRUCell doesn’t work on CUDA, but it does on CPU…? Either way, adding a flag in my code to check for the batch size of the input to the GRUCell fixed the issue.
The code works using the nightly binary from ~1 week ago, so this issue might have been already fixed.
Could you install the nightly binary (in a new virtual environment) and rerun the code, please?