Hi,
I am trying to optimize an MLP using multi-losses that are computed at different times for the same neural network. Basically, the problem is a combination of supervised and semi-supervised learning.

I wish to optimize the algorithm using SGD with labels for each training example, and then would like to also optimize the network using SGD with labels for a set of training examples (e.g. sum of predictions over a certain period of time will have a label)

dataset = {clip_label: frame1_label:frame1,frame2_label: frame2, frame3_label:frame3}
# unrolled loop
optimizer.zero_grad()
output = net(frame1)
loss = criterion(output, frame1_label)
loss.backward()
optimizer.step()
summed_output += torch.sum(output) # sum predicted output
optimizer.zero_grad()
output = net(frame2)
loss = criterion(output, frame2_label)
loss.backward()
optimizer.step()
summed_output += torch.sum(output) # sum predicted output
optimizer.zero_grad()
output = net(frame3)
loss = criterion(output, frame3_label)
loss.backward()
optimizer.step()
summed_output += torch.sum(output) # sum predicted output
# Optimize at the clip_level
optimizer.zero_grad()
loss = criterion(summed_output , clip_label)
loss.backward()
optimizer.step()

This is how the code should look like. Zeroing the gradients might be problematic for the clip_level optimization. When I run my code, I get this error:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [8, 1]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

If I detach summed_output from the graph, then this error disappears, but I do not think the training process is correct.

Any idea how to accomplish this process correctly?

The error you see happens because when you do optimizer.step() in the first iterations, you modify the weights of net inplace. But the final backward would actually need the original values to be able to backprop though the original net. Hence the error you see.

The final backward will backprop in the same net as the first ones so you need the first ones to set retain_graph=True so that the last backward won’t fail saying that buffers have been freed.

The question I would have is: do you want to compute the gradients for the final loss based on the original parameter values or the updated ones?
If you want the original, then you will have to use different version of the net so that you can keep the original values around to be able to backprop to them.
If you want to get wrt the updated ones, you should re-do the forward based on the final parameter values to compute the summed output at the end.

I am currently using retain_graph and compute the gradient wrt the last updated weights, however, the issue is that summed_output needs to be detached from the original graph since I already made several optimization steps and zeroed the gradients at each step. what is happening now, is that I detach it, then compute output and add it summed_output so that when I do backward, it is done wrt to the last forward pass. This however, is not what I want and I do not think it’s valid. Per the second option, is this how you considered doing it?

I would like to calculate gradients based off the original parameters (from every iteration). As you noted that I have to use a different version of the net, how can I do that?

you’re suggesting that I append all frames of a clip from each iteration then compute them in a single batch for the clip level optimization. At the same time, I’d be doing per-frame optimization for each iteration.

I think this would work, but it would be slow since we’re doing a forward pass twice for the same frames.

Yes you will do the forward twice, but you don’t really have a choice since you want gradients for different versions of your parameters.

The only other way around it I can think about (in either method) is to save the gradients and do all the updates at the end. But even then, you will use gradients computed for different weight values to update the new ones which doesn’t really make sense…
You have to redo the forward with the new weights if you want to have the right gradients.