My GPU memory keeps on increasing after every iteration. I have 2 losses (h_loss, f_loss) and etas is a list of trainable parameters defined outside the loop. Here is the code snippet inside the training loop.

# define etas (trainable Variables)
while iters < opts['max_iters']:
data = torch.from_numpy(data)
input_var = Variable(data)
optimizer.zero_grad()
# get h_loss and f_loss
h_grads = torch.autograd.grad(outputs=h_loss, inputs=cv_params, create_graph=True)
torch.autograd.backward(cv_params, h_grads)
for eta_i, param in enumerate(cv_params):
param.grad = param.grad * etas[eta_i]
f_grads = torch.autograd.grad(outputs=f_loss, inputs=cv_params, create_graph=True)
torch.autograd.backward(cv_params, f_grads)
flat_params = []
for param in cv_params:
flat_params.append(param.grad.view(-1))
flat_params = torch.cat(flat_params, 0)
var_loss = (flat_params**2).mean()
var_grads = torch.autograd.grad(outputs=var_loss, inputs=var_params, create_graph=True)
torch.autograd.backward(var_params, var_grads)
optimizer.step()
# get next data batch
iters += 1

param.grad is a Variable, so unless you zero_grads() during each iteration then they may be keeping their computation graphs. If so, then you can get around that by detaching them at the end of the loop.

If you keep track of the loss from one iteration to the next, then unless you do so using loss.data[0] it might be storing its computation graph with it.

Yes I do zero_grads() at the start of the iteration and I am not accumulating losses from previous iterations. The whole training loop is described above.

computes the gradients of h_loss w.r.t. cv_params and puts the gradients in h_grads. From my tests neither h_grad nor cv_param will have a grad_fn after this step, so I donâ€™t see what graph gets created, nor where it is stored.

torch.autograd.backward(cv_params, h_grads)

Calculates the gradients of cv_params using h_grads as the initial gradients. I donâ€™t see how this works. cv_params are not the results of computations, they are parameters of the model, so there is nothing to calculate. Oh I see, this simply adds h_grad to cv_param.grad.

Skipping a bitâ€¦

flat_params = []
for param in cv_params:
flat_params.append(param.grad.view(-1))
flat_params = torch.cat(flat_params, 0)

This copies all cv_param.grad into a flat Variable. At this point, flat_params depends on param.grad for param in cv_params. So var_loss is simply the sum of the squares of param.grad, and these are copied into var_param.grad for var_param in var_params.

My understanding of all this is probably way off mark and I still canâ€™t see where some bits of computation graph might be left hanging around.

Doing

for param in cv_params:
param.detach()

at the end of each iteration might conceivably help, but I really have no idea.

So I narrowed down the problem. It seems like thereâ€™s a problem when I do the last grad update with inputs where baseline_params is part of the network and etas are defined as:

for _ in cv_params:
eta = 2 * F.sigmoid(Variable(torch.zeros(1), requires_grad=True))
if opts['cuda']:
eta = 2 * F.sigmoid(Variable(torch.zeros(1), requires_grad=True).cuda())
eta_params.append(nn.Parameter(eta.data))
etas.append(eta)

I actually found a bug where I was doing this: var_params = baseline_params + etas
instead of var_params = baseline_params + eta_params
But changing to this gives an error: RuntimeError: One of the differentiated Variables appears to not have been used in the graph
If I change the last 2 lines to: