HI, I tried looking in the documentation but I couldn’t find what I’m looking for.
My question is: how are intermediate gradient collected by the backward hooks affected by amp?
I understand that during training we should use GradScaler
on the loss as GradScaler.scale(loss).backward()
, but what about the gradient collected by a backward hook? Should that also be scaled?
EDIT:
I tried doing some testing with a simple setup and found out that if amp is NOT used I obtain different gradients from when amp IS used e.g.
tensor([[-0.6644, 0.6644]], device='cuda:0') # normal
tensor([[-43552., 43552.]], device='cuda:0', dtype=torch.float16) # amp
but using GradScaler.scale on the gradient did not return the non-amp one. Is there a way to use amp and hooks to gather intermediate gradients?
Here the main parts of my testing setup
Model:
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.fc1 = nn.Linear(10, 5, bias=True)
self.r1 = nn.ReLU()
self.fc2 = nn.Linear(5, 2, bias=True)
def forward(self, x):
x = self.fc1(x)
x = self.r1(x)
x = self.fc2(x)
return x
Hook:
def backward_hook(module, grad_input, grad_output):
print(grad_output[0])
Training:
dataset = TensorDataset(x, y)
dataloader = DataLoader(dataset, shuffle=False, batch_size=1, num_workers=0)
for data, target in dataloader:
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
if amp:
with autocast():
out = model(data)
loss = loss_fn(out, target)
scaled_loss = scaler.scale(loss)
scaled_loss.backward()
scaler.step(optimizer)
scaler.update()
else:
out = model(data)
loss = loss_fn(out, target)
loss.backward()
optimizer.step()