Changin reduction parameter in the BCEWithLogitLoss for multi-label problem, change the results a lot

mahsa · November 25, 2019, 9:11pm

Hi,
I trained a deep neural network for multi-label classification. When I used the ‘mean’ reduction in the BCEWithLogitsLoss, the results are so low and when I changed it to ‘sum’, the results get so much better. What does this say about my model? Is there any problem with the gradient value in the first model?
Thanks

ptrblck · November 25, 2019, 9:31pm

Have you played around with some hyperparameters, e.g. have you tried to increase the learning rate, as the gradients should be larger using the 'sum' reduction:

# mean
torch.manual_seed(2809)
model = models.resnet18()
target = torch.randint(0, 2, (1, 1000)).float()
criterion = nn.BCEWithLogitsLoss(reduction='mean')
output = model(torch.randn(1, 3, 224, 224))
loss = criterion(output, target)
loss.backward()
print(model.conv1.weight.grad.sum())
> tensor(-0.1137)

# sum
torch.manual_seed(2809)
model = models.resnet18()
target = torch.randint(0, 2, (1, 1000)).float()
criterion = nn.BCEWithLogitsLoss(reduction='sum')
output = model(torch.randn(1, 3, 224, 224))
loss = criterion(output, target)
loss.backward()
print(model.conv1.weight.grad.sum())
> tensor(-113.7127)

mahsa · November 26, 2019, 11:05pm

Yes, I changed learning rate from 0.01 to 0.5 and the results with ‘mean’ don’t get better at all. I have absolutely no idea why this happens.

mailcorahul · November 27, 2019, 6:20am

What are the loss values you get for mean and sum reductions?
@ptrblck Does it make sense to use a higher learning rate incase of mean reduction since it could be the case where loss & gradients are small? or use a constant multiplier for loss to magnify it?

ptrblck · November 27, 2019, 6:28am

I would use the 'mean' reduction as the default value and try out different learning rates.
Otherwise your training (and learning rate) will depend on the batch size.

However, if a summed loss gives better convergence than the mean, a possible explanation would be the increased scale of the gradients.

mahsa · November 27, 2019, 6:34am

Thanks a lot for your help.

mahsa · November 27, 2019, 6:59am

When I use mean the training loss starts with ‘1.3166’ and when I use sum it starts with ‘264594.4062’(the dataset has 39 labels) (and it doesn’t make sense I think)

mailcorahul · November 27, 2019, 7:15am

It could be a case of lower learning rate incase of mean reduction. Try a higher learning rate to see if it converges.

mahsa · November 27, 2019, 10:14am

I tried increasing and decreasing learning and nothing has changed.

mailcorahul · November 27, 2019, 10:52am

Does your loss remain constant(or undergo very minimal change) when mean reduction is used?

mahsa · November 27, 2019, 11:07am

When I set lr to 0.001, the training loss decreases very slowly (0.0004 approaximately in every iteration) but the test loss remains. When I set it to 0.1, the training loss decreases less slowly (about 0.002 in each iteration), and the test loss decrease and increase

mailcorahul · November 27, 2019, 11:12am

Try to scale up your loss by some constant. Also try using even higher learning rate in such a way your train loss doesn’t oscillate much across iterations.