Summing the losses from different models produces different results

silvester · October 24, 2020, 12:52am

Assuming two independent models model0 and model1, when I train these two models individually with losses loss0 and loss1 correspondingly, I get different results (accuracy) compared to the time that I train the models together and with the loss loss = loss0+loss1.
I appreciate any thought on this problem.
Cheers

InnovArul · October 24, 2020, 1:00am

I think, that’s how it is supposed to work. No?
For instance, softmax and triplet losses work this way too.

silvester · October 24, 2020, 1:04am

Could you please elaborate, as grad(loss) = grad(loss0) + grad(loss1) and these two models have no parameters in common. So they should get same gradients as before when I trained them individually.

I again emphasize that the only difference is that I sum the losses and backpop on the summed loss.

InnovArul · October 24, 2020, 1:09am

I see. I missed that they are two independent models.

How do you initialize optimizers?
Do you use the same optimizer for both?

silvester · October 24, 2020, 1:10am

The parameters for two models are put into list and an optimizer is setup based on the parameters in the list. So I assume the optimizer should work similar for both of the models.

InnovArul · October 24, 2020, 1:16am

As you said here, gradients are independent I suppose.
So, I am not sure how to debug this. I do not know if the advanced optimizers (Like Adam) consider the parameters as independent entities or if they perform any operations to normalize the gradients somehow.

Did you try with simple SGD?

silvester · October 24, 2020, 1:19am

What I did is with SGD with decay rate and momentum, though the idea of normalization seems interesting to me, too. But it is not clear when this is happened. I am suspicious may it is also related to the number of parameters, as in the summed case the number of parameters is twice of the number of parameters in the individual training.

InnovArul · October 24, 2020, 2:00am

I am not sure about the case of the number of parameters. Do you have a concrete reason behind this thought? Is it related to overfitting?

As the gradients are independent, I would expect them to behave more or less similarly in both cases.
I hope you do not have implementation-related issues.

silvester · October 27, 2020, 4:44pm

@ptrblck I was hopping to get some leads from Pytorch community about this issue.

silvester · October 30, 2020, 3:07am

Does applying np.float32 rather than np.int64 for the labels lead to degradation in accuracy?

ptrblck · October 30, 2020, 10:59am

If the losses are independent and don’t share any parameters, which were used to compute them, you should get the same results.
Are you able to reproduce this issue deterministically by seeding the code?
If so, could you share a code snippet so that we could have a look?
Here is a small example, which yields exact the same gradients:

# separate losses
torch.manual_seed(2809)

model1 = models.resnet18()
model2 = models.resnet34()

x = torch.randn(2, 3, 224, 224)
target = torch.tensor([0, 999])
criterion = nn.CrossEntropyLoss()

output1 = model1(x)
output2 = model2(x)
loss1 = criterion(output1, target)
loss2 = criterion(output2, target)

loss1.backward()
loss2.backward()
grad1_ref = {name: p.grad.clone() for name, p in model1.named_parameters()}
grad2_ref = {name: p.grad.clone() for name, p in model2.named_parameters()}

model1.zero_grad()
model2.zero_grad()

# loss sum
torch.manual_seed(2809)

model1 = models.resnet18()
model2 = models.resnet34()

x = torch.randn(2, 3, 224, 224)
target = torch.tensor([0, 999])
criterion = nn.CrossEntropyLoss()

output1 = model1(x)
output2 = model2(x)
loss1 = criterion(output1, target)
loss2 = criterion(output2, target)

loss = loss1 + loss2
loss = loss.backward()
grad1 = {name: p.grad.clone() for name, p in model1.named_parameters()}
grad2 = {name: p.grad.clone() for name, p in model2.named_parameters()}

# compare
for name in grad1_ref:
    print('model1 grad diff for {}: {}'.format(
        name, (grad1_ref[name] - grad1[name]).abs().max()))
    print('model2 grad diff for {}: {}'.format(
        name, (grad2_ref[name] - grad2[name]).abs().max()))