The order of parameter during backward()

taehyunzzz · March 18, 2023, 2:44pm

Hi, I am looking into the mechanics of the backward() function, and I have something I can’t get the answer to.

From my understanding of the backward function, it is supposed to start at the last layer of the model to compute gradients, and continue to the first layer by passing on gradients along the way. How can I know the exact order of parameters that are backward-passed? Say I have layer1 that is connected to layer2, layer3 and layer4 in parallel, and then layer2, layer3 and layer4 are connected to layer5. What should be the order for backpropagation?

Also, I set some post-backward hooks to show the order of parameters that are backward-passed, but found out that some parameters are not backward-passed at all. I checked that these parameters have requires_grad=True. Is it possible that gradients are not propagated to this layer even with this setting?

ptrblck · March 19, 2023, 12:17am

I would expect to see the reversed order of execution as used in the forward pass.
E.g. if you connect layer1 to multiple layers, the forward pass would still have a defined order of execution:

x = layer1(x)
x2 = layer2(x)
x3 = layer3(x)
...

However, I haven’t experimented with it and did not verify if Autograd is able to change the order as long as all dependencies are correct.

If parameters require gradients and were used to create the output or loss, they should accumulate the gradient in their .grad attribute. Could you describe this issue in more detail, please?

taehyunzzz · March 19, 2023, 3:22am

@ptrblck This could be a problem with the particular code that I am trying to run. I am running the cifar10 MoE example in DeepSpeedExamples.

As posted here DeepSpeed-MoE’s expert gating layer is not seen in backpropagation. I tried to print out the name of the layer being back-propagated using a backward hook, but the name (or param_id) for this layer does not show up when running backward.

I should check to see if this parameter’s param.grad is non-zero, but if gradient was calculated during backward(), I think that the backward hook should have been called. I am waiting on a comment on my post on DeepSpeed github issues to check if this is an expected behavior.

ptrblck · March 19, 2023, 5:41am

I don’t know how you have registered the hooks, but maybe try to register hooks on each parameter to fire when the actual gradient is set as seen here:

model = models.resnet18()

grads = {}
def get_hook(name):
    def hook(grad):
        print("setting grad for param {} with abs().sum() gradient {}".format(
            name, grad.abs().sum()))
        grads[name] = grad.abs().sum().clone()
    return hook

for name, param in model.named_parameters():
    param.register_hook(get_hook(name))
    
x = torch.randn(1, 3, 224, 224)
out = model(x)
out.mean().backward()
# setting grad for param fc.bias with abs().sum() gradient 1.0000001192092896
# setting grad for param fc.weight with abs().sum() gradient 422.3572998046875
# setting grad for param layer4.1.bn2.weight with abs().sum() gradient 0.09817761182785034
# ...
# setting grad for param bn1.weight with abs().sum() gradient 0.04948922619223595
# setting grad for param bn1.bias with abs().sum() gradient 0.018050558865070343
# setting grad for param conv1.weight with abs().sum() gradient 39.974876403808594

# verify
compare = {}
for name, param in model.named_parameters():
    print("param {} grad.abs().sum() {}".format(name, param.grad.abs().sum()))
    compare[name] = param.grad.abs().sum()
# param conv1.weight grad.abs().sum() 39.974876403808594
# param bn1.weight grad.abs().sum() 0.04948922619223595
# param bn1.bias grad.abs().sum() 0.018050558865070343
# ...
# param layer4.1.bn2.bias grad.abs().sum() 0.20669560134410858
# param fc.weight grad.abs().sum() 422.3572998046875
# param fc.bias grad.abs().sum() 1.0000001192092896

# compare using dicts as the order is reversed on the previous outputs
for name in grads:
    reference = grads[name]
    current = compare[name]
    print("{} abs().max() error {}".format(name, (reference - current).abs().max()))
# fc.bias abs().max() error 0.0
# fc.weight abs().max() error 0.0
# layer4.1.bn2.weight abs().max() error 0.0
# ...
# bn1.weight abs().max() error 0.0
# bn1.bias abs().max() error 0.0
# conv1.weight abs().max() error 0.0

taehyunzzz · March 19, 2023, 6:38am

I am sorry for not providing enough context. I should try out your suggestion.

The function reduce_partition_and_remove_grads is registered as hook to the gradient accumulation function with register_hook inside the deepspeed code.
If you are curious you can take a look at this code inside stage_1_and_2.py. In the existing reduce_partition_and_remove_grads code, I added a minimal code to print the ID of the parameter that is having its gradient accumulated.

def reduce_partition_and_remove_grads(*notneeded):
    print("Running hook for {}".format(self.get_param_id(param)))    # Added code
    self.reduce_ready_partitions_and_remove_grads(param, i)

Thank you @ptrblck for your comments