Multi loss optimizer

Hi there.

I have a question for the masters.
Please give me hope.

There are many layers in my model.
I get total_loss with sum of the final output and the output of a specific layer.
I proceed to differentiate total_loss, and of course there is only one optimizer.

def __init__(self):
  many layers
  many layers

def forward(self, x):
  output1 = Layer1(x)
  output_s = self.specific_layer(output1.detach())
  outputN = LayerN(LayerN-1(cat(output1, LayerN-2(...), .....)))
  final_output = FinalLayer(outputN)

  return final_output, output_s
model = MyModel()
optimizer = Adam(model.parameters(), ...)

  final_y_pred, specific_y_pred = model(x)

  final_loss = criterion(final_y_pred, final_target)
  specific_loss = criterion(specific_y_pred, specific_target)
  total_loss = final_loss + specific_loss

  total_loss .backward()

The total_loss converges, but the loss of a specific layer oscillates. T.T

I thought of two ways…

The first is to train by applying learning-rate differently (smaller learning-rate to the oscillated specific layer) using “per-parameters-options”.

The second method is to train by applying optimizer separately.

  1. Remove the specific_layer from MyModel and set the output to output1, final_output.
  2. Set the separated specific_layer as MyModel2 with output1 as input and return output_s.
  3. Set the optimizer for MyModel and MyModel2 , respectively.

Which is a better way?
And is there another good way?

Thank you in advance.

Based on the posted pseudo code it seems that specific_loss would create the gradients for self.specifc_layer only (layer1 and previous layers are detached and no other layers are used to create output_s), while final_output might potentially use many more layers (thus also parameters).
If so, it could be easier for the model to drive final_loss down as its optimizer would potentially update many more parameters, but that’s just my guess.

Hi @ptrblck
Thanks for reply.

You’re right!
final_output use many many more layers…
specific_layer has only two layer(LSTM, FC) but the other consists of multiple CNNs, LSTMs, FCs, …

What should I do in this case?

I’m unsure if it would work, but you could try to scale up the specific_loss to try to force the model to focus more on it.