Hi all. Last time I complained that my MSE loss is not converging with Adam optimizer and ResNet50 architecture. I think I may have found the problem but I’m not sure.

For now I’m simply feeding the prediction of my ResNet50 and the target value to the MSE loss, i.e. loss = criterion(output, target.float()). In my research, instead of using the ResNet50 to do the prediction directly, I want to predict a “residual”, which means loss = criterion(output, target.float() - something). However, if I do this, then the loss does not converge. It fluctuates around the starting point instead. Can someone give some possible explanation?

Subtracting a constant from the target could change the loss range and thus the gradients and could also need an adaption of hyperparameters (e.g. lower learning rate to avoid exploding gradients).
Also, depending on the value of something, you model might need more time to learn the bias to counter this offset.
Here is a very simple example using a single linear layer, which explains my claim a bit better:

# standard use case
torch.manual_seed(2809)
# setup
model = nn.Linear(1, 1)
data = torch.randn(1, 1)
target = torch.randn(1, 1)
print('data {}, target {}'.format(data, target))
print('model parameters {}'.format(list(model.parameters())))
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.MSELoss()
nb_epochs = 2000
for epoch in range(nb_epochs):
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
print('epoch {}, loss {}'.format(epoch, loss.item()))
print('output {}, target {}'.format(output, target))
print('model parameters {}'.format(list(model.parameters())))
# biased use case
torch.manual_seed(2809)
# setup
model = nn.Linear(1, 1)
data = torch.randn(1, 1)
target = torch.randn(1, 1)
print('data {}, target {}'.format(data, target))
print('model parameters {}'.format(list(model.parameters())))
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.MSELoss()
nb_epochs = 2000
for epoch in range(nb_epochs):
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target - 10000.)
loss.backward()
optimizer.step()
print('epoch {}, loss {}'.format(epoch, loss.item()))
print('output {}, target {}'.format(output, target))
print('model parameters {}'.format(list(model.parameters())))

While the “standard” use case converges fine, the latter approach starts with a very high loss and isn’t able to converge in the 2000 epochs.

I see how the loss behavior can change depending on the target range. However, I now realize the reason why my loss does not converge is caused by some unusual behavior of my network. It is always learning the mean of my targets but not learning the variance. Please see my new post here: CNN only learning the mean of targets. Thank you so much for helping!