torch.cuda.amp.GradScaler scale going below one

mrartemev · October 29, 2020, 2:34pm

Hi!
For some reason, when I train WGAN-GP with mixed precision using torch.cuda.amp package, something happens to my GradScaler for the critic. During the training, the scaler’s scale is decreasing from its usual values to very low ones (like 1e-7).
Strangely, it happens only for critic and only for WGAN-GP model. When I try LS-GAN or a simple GAN everything is ok.
So, the problem is somehow inside the GP part, but I just can’t locate it. (for some reason, fp16 grads for GP contains nans, but I have no idea why)

So, the question is: what could be a solution to such low scale values and what is the reason for it?

ptrblck · October 30, 2020, 4:25am

This could point towards a high loss and isn’t a problem by itself, if the loss scaler stops decreasing the value after a while and thereafter does it sporadically.

This is expected, if the loss scaling is too high and the loss scaler will reduce it’s scaling value if NaN values are encountered.

However, the loss value should never become invalid (NaN/Inf). Is this the case for your model?

mrartemev · November 1, 2020, 11:26am

Hi! Thanks for answering!

That was my first idea, but, unfortunately, the loss value is average (~0.1 - 2), so there is no reason for the scaler to go that low in the first place. Also, it doesn’t stop decreasing, so that’s that.

The loss values are always valid. However, when I was running the script with torch.autograd.anomaly_detector, there was nans/infs in gradients every ~10th iteration, so that’s the reason for scaler to go down.

ptrblck · November 1, 2020, 9:55pm

That might be expected and doesn’t necessarily point towards a broken training.
A high loss could create large gradients, which might overflow, but if that’s not the case also high parameters might create these large gradients as seen here:

# high loss
model = nn.Sequential(
    nn.Linear(1, 1, bias=False),
    nn.Linear(1, 1, bias=False))
data = torch.randn(1, 1)
out = model(data)
loss = out.mean() * 1000.
loss.backward()
print(loss.item())
print(model[0].weight.grad)
print(model[1].weight.grad)


# large parameter values
model = nn.Sequential(
    nn.Linear(1, 1, bias=False),
    nn.Linear(1, 1, bias=False))
with torch.no_grad():
    model[1].weight *= 1000.
data = torch.randn(1, 1)
out = model(data)
loss = out.mean()
loss.backward()
print(loss.item())
print(model[0].weight.grad)
print(model[1].weight.grad)

Are you only concerned about the loss scaler value or is your training not converging in any way?