Is Autocast Failing to Cast Gradients?

kheyer · February 19, 2024, 7:50am

If I autocast to fp16, should I expect gradients to be computed in fp16 as well?

I’ve noticed that when I explicitly call .half() on a model, gradients will be computed in fp16. But when I use autocast, the gradients are computed in fp32. Is this the expected behavior?

Example:

import torch
import torch.nn as nn

torch.set_default_device('cuda')

model = nn.Linear(8,1)
opt = torch.optim.SGD(model.parameters(), lr=1e-3)

x = torch.randn(12, 8, dtype=torch.float16)

with torch.autocast(device_type='cuda', dtype=torch.float16, enabled=True):
    output = model(x)
    loss = output.mean()
    loss.backward()
    
print(model.weight.grad.dtype) # gradient is computed in fp32 despite autocast
# > torch.float32

opt.zero_grad()

model.half()

with torch.autocast(device_type='cuda', dtype=torch.float16, enabled=True):
    output = model(x)
    loss = output.mean()
    loss.backward()
    
print(model.weight.grad.dtype) # after manually calling .half(), gradient is computed in fp16
# > torch.float16

ptrblck · February 19, 2024, 1:37pm

Yes, this is expected behavior as the weights of the model won’t be transformed and thus the wgrads will have the same dtype, which is float32 by default.