Only use part of the network as a new model for the other task

fyjl · March 14, 2018, 8:58am

There are two separate tasks:

an auto-encoder for reconstruction
an encoder for regression

For the second task, I would like to use the code layer of the first task as my target, after the first model is trained; and the second model would be identical to the encoder part of the auto-encoder, instead of using a literally separate model.

That is, when training the second model:

# setting model1 = autoencoder(encoder_only=False) during task1 training
model1 = torch.load("path/to/trained/autoencoder") # After task1 training, load the trained auto-encoder
model1.eval()
model1.encode_only = True # set the flag to be True to make forward return the code layer output
model1_output = model1(input1)

# task2 training
model2 = autoencoder(encoder_only=True) # instantiate the second model whose forward return the code layer output
model2_output = model2(input2)
loss2 = loss(model2_output, model1_output)
loss2.backward()
optimizer2.step()

The encode_only flag tells the forward function to either just return the code layer output or the reconstructed input. It doesn’t change network architecture at all.

My question is, will loss2.step() update the rest of the unintended network of model2 (the decoder part of the model1), or the loss2 only backprop through the network prior to the model2_output (the intended, encoder part of model1)?

richard · March 14, 2018, 2:56pm

What is loss2.step()? Do you mean stepping with an optimizer?

To prevent gradient computation for model1’s parameters, you can detach model1_output from your computation graph:

loss2 = loss(model2_output, model1_output.detach())

fyjl · March 14, 2018, 3:50pm

Yes you are right, that was a typo. I’ve edited the OP.

I apology if not clear; what I mean is that though model2 is identical to model1, I only care about its weights of the encoder part, and I use the output of the encoder part for loss computation. My question is do loss2.backward() propagate to the whole network (encoder + decoder) of model2, or only the partial network of interest which is the encoder part? Likewise, do optimizer2.step() update weights for the whole network or only the encoder?

I added model1.eval() to the snippet, which I believe avoids gradient computation for model1, and detach() is not necessary. Correct me if wrong, please.

richard · March 14, 2018, 4:28pm

Where is the decoder? If you don’t use the decoder in model2 to create output2 or for the creation of input2, loss2.backward() has no way of finding the decoder and updating any gradients.

optimizer2.step() will update weights for whatever Parameters you passed into the optimizer construction. If those Parameters had 0 or None gradients, then those weights will not be updated.

I don’t think .eval() guarantees gradient computation is avoided. It depends on how model1 is written.

fyjl · March 14, 2018, 4:54pm

Thanks for your reply.

Though optimizer2 is constructed with parameters of model2, which includes an encoder and a decoder, loss2 is computed with forward pass of input1 and input2 with encode_only, which are respectively model1_output and model2_output; decoder is not involved in the forward pass.

So yes, neither model1_output and model2_output are created by the decoder. The decoder was only involved in training model1 (i.e., task1, but I am focusing on task2 in this post).

Therefore, what happens is as you said: "loss2.backward() has no way of finding the decoder and updating any gradients", thereby optimizer2.step() doesn’t update the weights of the decoder, correct?

I will play around .eval() and detach(). If you are correct, loss2.backward() and optimizer2.step() will update the gradients and weights of the encoder in the loaded, trained model1, which is not desired.

richard · March 14, 2018, 4:56pm

Yes in that case if model2_output does not depend on the decoder then the decoder weights will not have gradients (after loss2.backward()) and as a result optimizer2.step() will not update the weights of the decoder (due to the gradients being either 0 or None).