Training additional layer without hurting original model performance

lazybut · February 28, 2020, 1:41pm

Hi, I trying to train auxiliary task using original model’s mid-block features without influence to original model’s performance. I use detach(), however original model’s performance is decreased.

there are two model

the main model
the auxiliary model that takes main model’s mid block

The main network have Block 1 and fully-connected layer like below,
input -> B1 -> FC1 -> score
then, auxiliary model have only fully-connected layer
input(it will be B1) -> FC2

I want to train moth model simultaneously, auxiliary model doesn’t affect to main model.

here’s my code

two model

MainModel / AuxModel

two optimizer

opM = optimizer(MainModel)
opA = optimizer(AuxModel)

training main model

MainModel .zero_grad()
#when I forward to main_model, I got score and midBlock
mainLoss, B1 = MainModel(input)
mainLoss.backward()
opM.step()

training aux model

AuxModel.zero_grad()
B1 = B1.detach()
prediction= AuxModel(B1)
auxLoss = criterion(prediction)
auxLoss.backward()
opA.step()

I try many things include each model’s param.requires_grad = False, however it still affect to original model’s performance.
could you help me with any advices?

simaiden · February 28, 2020, 2:01pm

If you are using MainModel only for get the mid block it’s not necessary to train it (if i understand the question), so you can do:

opA = optimizer(AuxModel)

with torch.no_grad():
   _, B1 = MainModel(input)
   
opA.zero_grad()
B1 = B1.detach()
prediction= AuxModel(B1)
auxLoss = criterion(prediction)
auxLoss.backward()
opA.step()

lazybut · February 28, 2020, 2:04pm

Thank you for the reply!
I need to train main model from scratch.
my question is the different performance between

training only main_model
training main_model and aux model at same time

stroncea · November 17, 2020, 3:12am

You could initialize the second optimizer with only the FC2 parameters, i.e excluding main model parameters. This should mean that the 2nd optimizer won’t update the weights of B1.

You would still need to make sure that during training of auxiliary model, gradients are not stored in B1 during the backward pass, and I think if you turn off the gradients in one model you will turn them off in the second model as well if they share underlying memory.

Off the top of my head, a possible solution would be implementing some locking mechanism between the 2 models s.t when you train 1st model, requires_grad is True, and False for 2nd model. There might be easier solutions though.