I am using an open source distributed PyTorch implementation of training AlexNet from scratch on ImageNet (https://github.com/richardkxu/distributed-pytorch).

This implementation works flawlessly as is. As soon as I add an additional loss (loss_contrastive) in the following manner:

```
loss = criterion(output, target)
loss_contrastive = getContrastiveLoss(target, rep3, rep4, rep5, contrastive_idxs)
loss += 0.1*loss_contrastive
optimizer.zero_grad()
# Mixed-precision training requires that the loss is scaled in order
# to prevent the gradients from underflow
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
```

I get a ZeroDivisionError on the last line. Also, I am getting a gradient overflow error for many consecutive steps (`Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5e-324`

) and looking at the two losses, both losses separately start at around ~10, and then `loss_contastive`

begins rapidly increasing. After many steps of `loss_contrastive`

being at around ~10^8 and many gradient overflows (here the original loss is ~50), both losses become NaNs.

For context, `loss_contrastive`

is simply a contrastive MSE loss which aims to minimize the distance between certain representations for certain inputs and maximize it for others. Am I treating the addition of a new loss incorrectly? Any ideas what might be causing this?

Thanks!