I am using an open source distributed PyTorch implementation of training AlexNet from scratch on ImageNet (https://github.com/richardkxu/distributed-pytorch).
This implementation works flawlessly as is. As soon as I add an additional loss (loss_contrastive) in the following manner:
loss = criterion(output, target) loss_contrastive = getContrastiveLoss(target, rep3, rep4, rep5, contrastive_idxs) loss += 0.1*loss_contrastive optimizer.zero_grad() # Mixed-precision training requires that the loss is scaled in order # to prevent the gradients from underflow with amp.scale_loss(loss, optimizer) as scaled_loss: scaled_loss.backward()
I get a ZeroDivisionError on the last line. Also, I am getting a gradient overflow error for many consecutive steps (
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5e-324) and looking at the two losses, both losses separately start at around ~10, and then
loss_contastive begins rapidly increasing. After many steps of
loss_contrastive being at around ~10^8 and many gradient overflows (here the original loss is ~50), both losses become NaNs.
loss_contrastive is simply a contrastive MSE loss which aims to minimize the distance between certain representations for certain inputs and maximize it for others. Am I treating the addition of a new loss incorrectly? Any ideas what might be causing this?