I’m splitting my network into two parts, top and bottom. Let’s call the resulting activation map v=bottom(x). I would like to optimize over L=top(v) treating the activation map v as weights. By default PyTorch does not retain gradients for activation maps, so I used v.retain_grad() in order to preserve the gradient. However when I optimize over the top part of the network I get the error: can’t optimize a non-leaf variable.

However this is exactly what I would like to do, why is PyTorch stopping me? And what can I do about it?

v = bottom(x)
v.retain_grad()
opt = t.optim.SGD([v],1e-3)
for i in range(10):
opt.zero_grad()
top(v).backward([v])
opt.step()

(I’m only optimizing over v here nothing else)

This kind of setup produces the error: can’t optimize a non-leaf variable. which is correct as v is connected from bottom to the top module. However I do not see why this is a valid reason for not being allowed to optimize over v.

Please take a look at how the optim package should be used in the doc here.
The optimizer should be give parameters, not the output of your network.

If your net is made of top and bottom.

If you want to optimize both of them:

# Outside training loop
optimizer = t.optim.SGD(itertools.chain(top.parameters(), bottom.parameters()), ...)
# Inside training loop
v = bottom(x)
out = top(v)
loss = criterion(out, target)
optimizer.zero_grad()
loss.backward()
optimizer.step()

If you want to optimize just top:

# Outside training loop
optimizer = t.optim.SGD(top.parameters(), ...)
# Inside training loop
v = bottom(x)
v.detach() # no need to backpropate to bottom
out = top(v)
loss = criterion(out, target)
optimizer.zero_grad()
loss.backward()
optimizer.step()

If you want to optimize just bottom:

# Outside training loop
optimizer = t.optim.SGD(bottom.parameters(), ...)
# Inside training loop
# You need to backprop all the way, so you need to do the full forward
v = bottom(x)
out = top(v)
loss = criterion(out, target)
optimizer.zero_grad()
loss.backward()
# Since the optimizer has only the bottom parameters, the top net will not be changed
optimizer.step()

Does this help? Or is your usecase something else?

Hey albanD, this is not what i’m looking for, I would like to optimize over the output of my bottom network, and this is part of my specific use case. I have found a way around this problem by wrapping v.data into a new Variable that is then considered a leaf node by Pytorch:

v_leaf = t.autograd.Variable(v.data, requires_grad=True)
opt = t.optim.SGD([v_leaf], 1e-3)
for i in range(10): opt.zero_grad(); top(v).backward([v]); opt.step()

Although this works I don’t see any reason why I need to use this workaround. One would think you can optimize over any variable as long as it stores it’s gradient. Yes it is a bit weird to optimize over non-leaf nodes but there are use cases out there where this may be useful.

The reason why this exists is because a non-leaf Variable is a tensor corresponding to a temporary computation between leaf Variables and the final Variable for which you are going to backpropagate. Since these are intermediary computation, it makes no sense to change them with an optimizer as you would end up with an invalid computational graph because you modified some values in the middle of the graph and so calling backward on that graph would return wrong values.

if you want to optimize a given element, it needs to be a leaf Variable. The way you do it is right.
One thing is that you do top(v).backward([v]). What this does is backpropagating from the output of top using the content of v as the initial gradients, is that expected? Also if you’re not doing double backprop, and this is actually the intended behavior, you can speed this up by providing v.detach() to the backward function. Because right now, since v.requires_grad=True, it creates the full graph for the double backward.