Hi,
I am attempting to train a model on the GPU and if the operation fails due to an out of memory error, I want to reduce the batch size and try again. I posted a testcase at Anomaly detection: Error detected in CudnnRnnBackward0 · Issue #65301 · pytorch/pytorch · GitHub showing that retrying the operation always fails with CUDA error: an illegal memory access was encountered
.
I suspect there is a bug in the pytorch C++ implementation but I can’t be certain without further investigation by a pytorch committer.
Is anyone else able to reproduce this error using the testcase?
Am I supposed to somehow reset the CUDA state after an out of memory error?
Is this code doing anything wrong?
Thank you,
Gili