If I autocast
to fp16, should I expect gradients to be computed in fp16 as well?
I’ve noticed that when I explicitly call .half()
on a model, gradients will be computed in fp16. But when I use autocast
, the gradients are computed in fp32. Is this the expected behavior?
Example:
import torch
import torch.nn as nn
torch.set_default_device('cuda')
model = nn.Linear(8,1)
opt = torch.optim.SGD(model.parameters(), lr=1e-3)
x = torch.randn(12, 8, dtype=torch.float16)
with torch.autocast(device_type='cuda', dtype=torch.float16, enabled=True):
output = model(x)
loss = output.mean()
loss.backward()
print(model.weight.grad.dtype) # gradient is computed in fp32 despite autocast
# > torch.float32
opt.zero_grad()
model.half()
with torch.autocast(device_type='cuda', dtype=torch.float16, enabled=True):
output = model(x)
loss = output.mean()
loss.backward()
print(model.weight.grad.dtype) # after manually calling .half(), gradient is computed in fp16
# > torch.float16