Unable to use CUDA after it fails due to an OOM error


I am attempting to train a model on the GPU and if the operation fails due to an out of memory error, I want to reduce the batch size and try again. I posted a testcase at Anomaly detection: Error detected in CudnnRnnBackward0 · Issue #65301 · pytorch/pytorch · GitHub showing that retrying the operation always fails with CUDA error: an illegal memory access was encountered.

I suspect there is a bug in the pytorch C++ implementation but I can’t be certain without further investigation by a pytorch committer.

Is anyone else able to reproduce this error using the testcase?

Am I supposed to somehow reset the CUDA state after an out of memory error?

Is this code doing anything wrong?

Thank you,

Not sure about illegal access, but check how much of your dataset your are loading and how much memory your model requires.