Debugging memory allocations in `torch.autograd.grad`

Hi all,

We have a model that involves an autograd call as part of its forward pass, something along the lines of:

return torch.autograd.grad(
            [prev.sum()],
            wrt_tensor,
            create_graph=True,  # needed to allow gradients of this output
        )[0]

When we scale the hidden dimension of our model we have started to get CUDA out-of-memory errors with enormous allocations at this call:

RuntimeError: CUDA out of memory. Tried to allocate 20.54 GiB (GPU 0; 11.17 GiB total capacity; 831.28 MiB already allocated;
 9.91 GiB free; 886.00 MiB reserved in total by PyTorch)

This happens the very first time this call is ever made, and does not make much sense: if the whole model, data, compute graph, etc. fit in 886 MB, why does autograd need to allocate 20GB?

Is there a way to determine which part of the compute graph is causing autograd to make this allocation? The memory profiler unhelpfully suggests that aten::empty is responsible, and the python debugger can’t go into autograd itself.

Thank you for your help!

Hi,

You can enable anomaly mode. That will show you the forward op that corresponds to the one that is failing in the backward.
Can you share this trace?

1 Like

Hi @albanD,

That ended up letting me find the issue — initially it hadn’t seemed useful because the offending code was automatically generated and wasn’t available for the traceback. Temporarily changing that made it possible to see which line was responsible and resolve the issue.

(For us, it was an issue with how we had expressed a broadcasting operation.)

Thanks!

1 Like