Hi, I trying to train auxiliary task using original model’s mid-block features without influence to original model’s performance. I use detach(), however original model’s performance is decreased.
there are two model
the main model
the auxiliary model that takes main model’s mid block
The main network have Block 1 and fully-connected layer like below,
input -> B1 -> FC1 -> score
then, auxiliary model have only fully-connected layer
input(it will be B1) -> FC2
I want to train moth model simultaneously, auxiliary model doesn’t affect to main model.
I try many things include each model’s param.requires_grad = False, however it still affect to original model’s performance.
could you help me with any advices?
You could initialize the second optimizer with only the FC2 parameters, i.e excluding main model parameters. This should mean that the 2nd optimizer won’t update the weights of B1.
You would still need to make sure that during training of auxiliary model, gradients are not stored in B1 during the backward pass, and I think if you turn off the gradients in one model you will turn them off in the second model as well if they share underlying memory.
Off the top of my head, a possible solution would be implementing some locking mechanism between the 2 models s.t when you train 1st model, requires_grad is True, and False for 2nd model. There might be easier solutions though.