The gradient from loss.backward() is different from the gradient computed manually

In order to find the reason why nan or inf produced in the training, I compute the gradident manually. But I found that the gradient from loss.backward() is different from the gradient computed manually.

I want to know what happen in backward() when training on multi gpu nodes.
I did the following experiment.
Steps:

  1. select layer: stem conv
  2. hooks: Hook the input, output, weight of the layer using register_forward_hook, use retain_grad() to hook the output grad
  3. compute the weight grad manually on each node

def compute_weight_grad(self, input_tensor, grad_output, layer_weight):
    kernel_size = 3
    stride = 2
    padding = 1
    fold_params = dict(kernel_size=kernel_size, stride=stride, padding=padding)
    unfold = torch.nn.Unfold(**fold_params)
    unfold_input = unfold(input_tensor)
    grad_output = grad_output.flatten(2)
    grad_weight = grad_output @ unfold_input.transpose(1, 2)
    grad_weight = grad_weight.sum(0)
    grad_weight = grad_weight.view_as(layer_weight)
  1. scale and average the gradients. Becuase of the amp training, the loss is scaled by 256, so here scale back the gradient, and then average the gradients from multi nodes.
  2. Questions
    The manually computed weight grad is different with the grad from model.parameters
    Define:
ratio = manually_grad / grad_from_parameter

I find that for each iteration, the ratio is const, and different iterations have different ratios
I want to know what happened? Thanks

clip_grad_norm is used