And one of the algorithms in the paper involves training the final layer w.r.t to a loss function (eg. cross-entropy), but learning all preceding layers using a regularization term in addition to the back-propagated loss. So I know that adding a regularization term for the entire network is done by just adding the term as a Variable to the loss functions value during the step:
loss += regularization_term loss.backward() opt.step()
but according to my understanding that modifies the loss for the entire network, thus am stuck on how to change the loss only for all layers except the final layer.
Any advice and help in this manner would be very helpful.
As you can see, the gradient for lin2 stays the same, while the gradient for lin1 changes.
This is due to the fact, that the parameters of lin2 were not involved in creating the regularization term, so they won’t be touched.
Thanks!
wow! i had no idea that the loss would retain knowledge of which layer’s parameters were used to compute the loss. Pytorch’s computation graphs are quite amazing!