In order to find the reason why nan or inf produced in the training, I compute the gradident manually. But I found that the gradient from loss.backward() is different from the gradient computed manually.
I want to know what happen in backward() when training on multi gpu nodes.
I did the following experiment.
Steps:
- select layer: stem conv
- hooks: Hook the input, output, weight of the layer using register_forward_hook, use retain_grad() to hook the output grad
- compute the weight grad manually on each node
def compute_weight_grad(self, input_tensor, grad_output, layer_weight):
kernel_size = 3
stride = 2
padding = 1
fold_params = dict(kernel_size=kernel_size, stride=stride, padding=padding)
unfold = torch.nn.Unfold(**fold_params)
unfold_input = unfold(input_tensor)
grad_output = grad_output.flatten(2)
grad_weight = grad_output @ unfold_input.transpose(1, 2)
grad_weight = grad_weight.sum(0)
grad_weight = grad_weight.view_as(layer_weight)
- scale and average the gradients. Becuase of the amp training, the loss is scaled by 256, so here scale back the gradient, and then average the gradients from multi nodes.
- Questions
The manually computed weight grad is different with the grad from model.parameters
Define:
ratio = manually_grad / grad_from_parameter
I find that for each iteration, the ratio is const, and different iterations have different ratios
I want to know what happened? Thanks