I’m using a seq2seq transformer network in a multi-task learning setting. I have a main text generation task and an auxiliary classification task that uses intermediate output as prediction. For both tasks, I do a full forward pass, but for the auxiliary task I only use the output of an intermediate layer to compute the loss.
My question is what happens when I do a backward pass on this intermediate loss. I would think the forward pass builds a computational graph for the entire network, but since the loss only uses part of this graph I assume the parameters that come after this intermediate layer will not be affected. Is this reasoning correct?
Thanks for you answer. However, the crux of my question is whether using intermediate network outputs to compute a loss and perform backprop, while doing a full forward pass, will or will not update parameters that influence outputs after my intermediate loss.
Hi,
I don’t think these two expressions are the same. because the loss is just a number, so when you backpropagate the addition of these two losses, the auxiliary loss also will propagate into the entire network which is not what Joris intended to do. So I think the second one make more sense for the discussed problem.