Adam Optimizer + fp16 autocast

Scott_Hoang · November 6, 2020, 3:06am

In order to avoid Null loss with Adam optimizer using fp16 autocast, I must modify the eps value from 1e-8 to 1e-6. However, I found that by doing this, my model is much slower to converge, or not converge at all. Does anyone know why this would be?

ptrblck · November 6, 2020, 6:37am

Could you explain what issues you are seeing in the loss and what Null means in this case?

Scott_Hoang · November 6, 2020, 6:39am

the issue being when eps is set to be 1e-8, as is the default, and used with autocast, the network’s loss will inevitably be Null after some epoch. However, increasing eps to higher value seems to make it go away, yet when compare to training not done on fp16, the convergence is much slower.

ptrblck · November 6, 2020, 6:40am

Are you seeing NaN loss values after a time or what is Null referring to?
If so, what kind of model architecture are you using?

Scott_Hoang · November 6, 2020, 6:41am

sorry, I meant Nan loss. I’m performing NAS search for pose estimation.

ptrblck · November 6, 2020, 8:44am

I’m not sure if your NAS implementation initializes the model parameters with “large” values explicitly or if the general training of these new architectures tends to create large output values, which might cause overflows easily. In that case I’m unsure if there is a better workaround than to increase the eps value, as 1e-8 might underflow in FP16 if you add it to a larger tensors.

Scott_Hoang · November 7, 2020, 5:12pm

Since eps control the ratio in which step size is decided, a larger eps would indicate a smaller step size. Would you suggest that I increase the overall lr rate to compensate ?

ptrblck · November 8, 2020, 1:51am

The eps value is used here to avoid dividing by a 0. value.
I just checked the state_dict of Adam using autocast and it seems all internal buffers are still stored in FP32, so I’m unsure why the eps value might cause trouble in this case (due to a potential underflow).

Here is a minimal code snippet:

model = nn.Linear(10, 10).cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

with torch.cuda.amp.autocast():
    out = model(torch.randn(1, 10).cuda())

loss = out.mean()
loss.backward()

optimizer.step()
print(optimizer.state_dict())

@mcarilli are you familiar with similar issues using Adam?

Scott_Hoang · November 8, 2020, 9:42pm

I use GradScaler, is this the problem?
can you use autocast independently from GradScaler?