I’d expect that you are probably at the peak of your GPU usage, so its a bit random that optimizer.zero_grad() is running oom.
I say that because optimizer.zero_grad() doesn’t do any new allocations, so it’s surprising that it runs OOM over there…
What might actually be happening is that the previous CUDA-level error didn’t get cleared yet, and CUDA is rethrowing the error on a subsequent CUDA call.