Using multiple models and optimizer in training


I’m now trying to use multiple models and optimizers in training.
During the training, it seems like the loss is not decreasing.

Can anyone know what the problem is?
The code is below

I have multiple models and optimizers, and in the for c in range(classnum):
loop, I am trying to compute the loss from each model and optimizing the model with optimizers corresponds to the model.

Note that each optimizer is different alghorithm.

models = list[mode1, model2,...., model10]
optimizers = list[optimizer1, optimizer2,...., optimizer10]

for model in models:
for step, inputs in enumerate(tqdm(train_loader)):
     inputs =
     for c in range(classnum):
           outputs, mean, logvar = models[c].forward(each_input)
           kld, recon = loss_function(outputs, each_input, mean, logvar, each_weights)
           loss = kld + recon

           running_loss += loss.item()

Thank you.


This approach looks good to me. You should not have any problem with it.
Does training a single model with your loss works?

Thank you for you’re reply.

Are you asking whether training loss decreased with single model and singe optimizer?

Yes it did

When I look at the training loss graph, looked like the loss was decreasing.


Although the difference of the loss between each loss is very small…

1st epoch
2st epoch
3st epoch
4st epoch
5st epoch


So it is training just very very slowly? Have you tried different learning rate? What about only 2 models?
I don’t know what your loss is, but is it properly averaged over the samples?

I’m using Adam as an optimizer and tried several learning rate.
But nothing actually changes. And haven’t tried training with 2 models yet.

The loss function is torch.nn.BCELoss and the reason that loss is large is that because reduction is sum.

The original data’s shapes are (batch size, C, H, W) = (32, 10, 100, 100) and during the training, I’m training each channel with an independent model and optimizer. Which means the shape of input data to model is (32, 1, 100, 100).

And also original data is the output of the segmentation model, so each channel is the probability.