We have a model that involves an autograd call as part of its forward pass, something along the lines of:
return torch.autograd.grad( [prev.sum()], wrt_tensor, create_graph=True, # needed to allow gradients of this output )
When we scale the hidden dimension of our model we have started to get CUDA out-of-memory errors with enormous allocations at this call:
RuntimeError: CUDA out of memory. Tried to allocate 20.54 GiB (GPU 0; 11.17 GiB total capacity; 831.28 MiB already allocated; 9.91 GiB free; 886.00 MiB reserved in total by PyTorch)
This happens the very first time this call is ever made, and does not make much sense: if the whole model, data, compute graph, etc. fit in 886 MB, why does autograd need to allocate 20GB?
Is there a way to determine which part of the compute graph is causing
autograd to make this allocation? The memory profiler unhelpfully suggests that
aten::empty is responsible, and the python debugger can’t go into autograd itself.
Thank you for your help!