Summing the losses from different models produces different results

Assuming two independent models model0 and model1, when I train these two models individually with losses loss0 and loss1 correspondingly, I get different results (accuracy) compared to the time that I train the models together and with the loss loss = loss0+loss1.
I appreciate any thought on this problem.
Cheers

I think, that’s how it is supposed to work. No?
For instance, softmax and triplet losses work this way too.

Could you please elaborate, as grad(loss) = grad(loss0) + grad(loss1) and these two models have no parameters in common. So they should get same gradients as before when I trained them individually.

I again emphasize that the only difference is that I sum the losses and backpop on the summed loss.

I see. I missed that they are two independent models.

How do you initialize optimizers?
Do you use the same optimizer for both?

The parameters for two models are put into list and an optimizer is setup based on the parameters in the list. So I assume the optimizer should work similar for both of the models.

As you said here, gradients are independent I suppose.
So, I am not sure how to debug this. I do not know if the advanced optimizers (Like Adam) consider the parameters as independent entities or if they perform any operations to normalize the gradients somehow.

Did you try with simple SGD?

What I did is with SGD with decay rate and momentum, though the idea of normalization seems interesting to me, too. But it is not clear when this is happened. I am suspicious may it is also related to the number of parameters, as in the summed case the number of parameters is twice of the number of parameters in the individual training.

I am not sure about the case of the number of parameters. Do you have a concrete reason behind this thought? Is it related to overfitting?

As the gradients are independent, I would expect them to behave more or less similarly in both cases.
I hope you do not have implementation-related issues.

@ptrblck I was hopping to get some leads from Pytorch community about this issue.

Does applying np.float32 rather than np.int64 for the labels lead to degradation in accuracy?

If the losses are independent and don’t share any parameters, which were used to compute them, you should get the same results.
Are you able to reproduce this issue deterministically by seeding the code?
If so, could you share a code snippet so that we could have a look?
Here is a small example, which yields exact the same gradients:

# separate losses
torch.manual_seed(2809)

model1 = models.resnet18()
model2 = models.resnet34()

x = torch.randn(2, 3, 224, 224)
target = torch.tensor([0, 999])
criterion = nn.CrossEntropyLoss()

output1 = model1(x)
output2 = model2(x)
loss1 = criterion(output1, target)
loss2 = criterion(output2, target)

loss1.backward()
loss2.backward()
grad1_ref = {name: p.grad.clone() for name, p in model1.named_parameters()}
grad2_ref = {name: p.grad.clone() for name, p in model2.named_parameters()}

model1.zero_grad()
model2.zero_grad()

# loss sum
torch.manual_seed(2809)

model1 = models.resnet18()
model2 = models.resnet34()

x = torch.randn(2, 3, 224, 224)
target = torch.tensor([0, 999])
criterion = nn.CrossEntropyLoss()

output1 = model1(x)
output2 = model2(x)
loss1 = criterion(output1, target)
loss2 = criterion(output2, target)

loss = loss1 + loss2
loss = loss.backward()
grad1 = {name: p.grad.clone() for name, p in model1.named_parameters()}
grad2 = {name: p.grad.clone() for name, p in model2.named_parameters()}

# compare
for name in grad1_ref:
    print('model1 grad diff for {}: {}'.format(
        name, (grad1_ref[name] - grad1[name]).abs().max()))
    print('model2 grad diff for {}: {}'.format(
        name, (grad2_ref[name] - grad2[name]).abs().max()))