Help with recovery from out of memory

Hi all,

I’m having trouble recovering from out of memory conditions. My minibatch routine looks like this:

  nll = computeminibatch(...)
  return nll
except RuntimeError as e:

This recovery routine works about half the time. The other half of the time, I crash with an out of memory exception thrown within zero_grad().

Is anyone else seeing this behavior? Is there a way to clean up and continue without triggering the second OOM condition?

This is pytorch 1.0.1 on python 2.7.15, x86_64, cuda 9.0, cudnn 7.0.4.


Following up my own post, I found that my problem seems to go away if I loop over params and simple set p.grad = None and empty the cache afterwards.

Do any pytorch devs have an opinion on the right way to do this? @smth @apaszke ?

I’d expect that you are probably at the peak of your GPU usage, so its a bit random that optimizer.zero_grad() is running oom.
I say that because optimizer.zero_grad() doesn’t do any new allocations, so it’s surprising that it runs OOM over there…

What might actually be happening is that the previous CUDA-level error didn’t get cleared yet, and CUDA is rethrowing the error on a subsequent CUDA call.

I did try to clear the cache before the zero_grad as well as after, but that didn’t help. For some reason, setting p.grad to None did work.