Hi all,
We have a model that involves an autograd call as part of its forward pass, something along the lines of:
return torch.autograd.grad(
[prev.sum()],
wrt_tensor,
create_graph=True, # needed to allow gradients of this output
)[0]
When we scale the hidden dimension of our model we have started to get CUDA out-of-memory errors with enormous allocations at this call:
RuntimeError: CUDA out of memory. Tried to allocate 20.54 GiB (GPU 0; 11.17 GiB total capacity; 831.28 MiB already allocated;
9.91 GiB free; 886.00 MiB reserved in total by PyTorch)
This happens the very first time this call is ever made, and does not make much sense: if the whole model, data, compute graph, etc. fit in 886 MB, why does autograd need to allocate 20GB?
Is there a way to determine which part of the compute graph is causing autograd
to make this allocation? The memory profiler unhelpfully suggests that aten::empty
is responsible, and the python debugger can’t go into autograd itself.
Thank you for your help!