How to predict the memory required to store a graph for higher order gradients

Is it possible to predict the memory required to store the graph needed if I call torch.autograd.grad with create_graph=True?

Here is a concrete example. Suppose I have a simple neural net with some layer sizes 100 x 10 x 1 so my feature dimension is 100 and I output a scalar. I want to add the gradients of the output with respect to my inputs to my loss. Let’s further suppose I have 50 examples of features so my input is of shape (50, 100), let’s call this x. I then compute

x = torch.rand(50, 100).to(device).requires_grad_(True)
predictions = model(x)
gradients = torch.autograd.grad(predictions.sum(), x, create_graph=True, retain_graph=True)

How much memory does each call need?

It would depend on the operators, like if my function in forward is squaring, my backward would be linear and hence would not need to save any tensors!

See concrete example given. I feel the general case involves non-linearities.