Unable to use CUDA after it fails due to an OOM error

Cow_woC · September 20, 2021, 2:39am

Hi,

I am attempting to train a model on the GPU and if the operation fails due to an out of memory error, I want to reduce the batch size and try again. I posted a testcase at Anomaly detection: Error detected in CudnnRnnBackward0 · Issue #65301 · pytorch/pytorch · GitHub showing that retrying the operation always fails with CUDA error: an illegal memory access was encountered.

I suspect there is a bug in the pytorch C++ implementation but I can’t be certain without further investigation by a pytorch committer.

Is anyone else able to reproduce this error using the testcase?

Am I supposed to somehow reset the CUDA state after an out of memory error?

Is this code doing anything wrong?

Thank you,
Gili

arya47 · September 20, 2021, 3:57am

Not sure about illegal access, but check how much of your dataset your are loading and how much memory your model requires.