I’m not sure, why the shapes differ, but apparently the wrong gradients are stored.
Here is a small dummy example using vgg16
:
grads = []
def save_grad(grad):
grads.append(grad)
# Create model
model = models.vgg16()
model.eval()
# First approach
x = torch.randn(1, 3, 224, 224)
output = model.features(x)
output.register_hook(lambda x: save_grad(x))
output = model.avgpool(output)
output = output.view(output.size(0), -1)
output = model.classifier(output)
output.mean().backward()
# Reset
model.zero_grad()
# Second approach
output = x.clone()
for name, module in model.features._modules.items():
output = module(output)
if '30' in name:
output.register_hook(lambda x: save_grad(x))
output = model.avgpool(output)
output = output.view(output.size(0), -1)
output = model.classifier(output)
output.mean().backward()
# Compare gradients
print((grads[0] == grads[1]).all())
> tensor(1, dtype=torch.uint8)
I tried to stick to your approach and as you can see, both methods yield the same gradients.
I guess data parallel isn’t being used, since you are not calling the model directly, but each submodule.
This might be related to this issue.