I am using an open source distributed PyTorch implementation of training AlexNet from scratch on ImageNet (https://github.com/richardkxu/distributed-pytorch).
This implementation works flawlessly as is. As soon as I add an additional loss (loss_contrastive) in the following manner:
loss = criterion(output, target)
loss_contrastive = getContrastiveLoss(target, rep3, rep4, rep5, contrastive_idxs)
loss += 0.1*loss_contrastive
optimizer.zero_grad()
# Mixed-precision training requires that the loss is scaled in order
# to prevent the gradients from underflow
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
I get a ZeroDivisionError on the last line. Also, I am getting a gradient overflow error for many consecutive steps (Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5e-324
) and looking at the two losses, both losses separately start at around ~10, and then loss_contastive
begins rapidly increasing. After many steps of loss_contrastive
being at around ~10^8 and many gradient overflows (here the original loss is ~50), both losses become NaNs.
For context, loss_contrastive
is simply a contrastive MSE loss which aims to minimize the distance between certain representations for certain inputs and maximize it for others. Am I treating the addition of a new loss incorrectly? Any ideas what might be causing this?
Thanks!