Forward and backward about pytorch

AndrewSoul · July 9, 2019, 7:17am

Hi, I want to ask about the difference between the following two pieces of code:

class ModelOutputs():
    """ Class for making a forward pass, and getting:
    1. The network output.
    2. Activations from intermeddiate targetted layers.
    3. Gradients from intermeddiate targetted layers. """

    def __init__(self, model, target_layers):
        self.model = model
        self.target_layers = target_layers
        self.gradients = None

    def save_gradient(self, grad):
        self.gradients = grad

    def __call__(self, x):
        conv_outputs = None
        for name, module in self.model.features._modules.items():
            x = module(x)
            if name in self.target_layers:
                x.register_hook(self.save_gradient)
                conv_outputs = x
        output = x
        output = output.view(output.size(0), -1)
        output = self.model.classifier(output)
        return output, conv_outputs

I passed vgg to model in ModelOutputs()
and

class VGG(nn.Module):
    def __init__(self, vgg_name):
        super(VGG, self).__init__()
        self.features = self._make_layers(cfg[vgg_name])
        self.classifier = nn.Linear(512, 10)
        self.gradients = None

    def save_gradient(self, grad):
        self.gradients = grad

    def forward(self, x):
        x = self.features(x)
        x.register_hook(self.save_gradient)
        inter = x
        x = x.view(x.size(0), -1)
        x = self.classifier(x)
        return x, inter

I’m just curious about what the difference between the two pieces of code above for backward and forward is? When the loss backward, will the gradients be different?
Could someone help me?
Appreciated for your help!

ptrblck · July 9, 2019, 10:01pm

You could use some dummy data and target, calculate the loss and gradients, and compare the gradients of (some) layers using something like this:

print(torch.allclose(modelA.model.features[0].weight.grad),
      modelB.features[0].weight.grad))

I would recommend using the second approach, since the first one is not defined as an nn.Module, which will yield errors, e.g. if you try to call model.parameters() etc.

AndrewSoul · July 10, 2019, 6:55am

Appreciated for your reply!

m_o = ModelOutputs(net, layer=40)
output, conv_output = m_o(input)
loss = cross_entropy(output, label)
loss.backward()
grad_m_o = m_o.gradients
grad_net = net.features[40].weight.grad

I find the grad_m_o is different from grad_net, why?

And I got another proplem: I want to load model in DataParallel and use multi-GPU in one server. like this:

os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2'
device = 'cuda' if torch.cuda.is_available() else 'cpu'

... ...
net.to(device)
net= torch.nn.DataParallel(model)
input.to(device) 
...

And I use the first approach in the code above.

class ModelOutputs():

m_o = ModelOutputs(net, layer)
output, conv_output = m_o(input)

I passed the net in ModelOutpus. When I ran the code, it did’t use multi-GPU but gpu’0.
Is the problem caused by not defined as nn.Module?

ptrblck · July 11, 2019, 10:37am

How large is the difference?
I’m not sure, if the code you’ve posted is still the currently used one, but it seems to be that you should pass targets_layers as a list of strings.

I think you are not using multiple GPUs, since you are never calling the model itself (and thus its forward method), but split the call into submodules and call them in a loop.
Could you try to call net(input) and check the GPU utilizations?

AndrewSoul · July 12, 2019, 2:03am

Sorry to reply late. Yeah, I passed string actually, like: [‘40’].

The shape between the m_o.gradients and net.features[40] is different.

Yes, you’re right. I didn’t call net(input) directly. How can I use multiple GPUs using ModelOutput? Inheriting class nn.Module and define a forward function?

ptrblck · July 12, 2019, 10:55am

I’m not sure, why the shapes differ, but apparently the wrong gradients are stored.
Here is a small dummy example using vgg16:

grads = []
def save_grad(grad):
    grads.append(grad)

# Create model
model = models.vgg16()
model.eval()

# First approach
x = torch.randn(1, 3, 224, 224)
output = model.features(x)
output.register_hook(lambda x: save_grad(x))
output = model.avgpool(output)
output = output.view(output.size(0), -1)
output = model.classifier(output)
output.mean().backward()

# Reset
model.zero_grad()

# Second approach
output = x.clone()
for name, module in model.features._modules.items():
    output = module(output)
    if '30' in name:
        output.register_hook(lambda x: save_grad(x))

output = model.avgpool(output)
output = output.view(output.size(0), -1)
output = model.classifier(output)
output.mean().backward()

# Compare gradients
print((grads[0] == grads[1]).all())
> tensor(1, dtype=torch.uint8)

I tried to stick to your approach and as you can see, both methods yield the same gradients.

I guess data parallel isn’t being used, since you are not calling the model directly, but each submodule.
This might be related to this issue.

AndrewSoul · July 15, 2019, 7:56am

Thanks for your quick reply!

Yes, with register_hood. Both method yield the same gradients. Before this, I have tried:

I think the gradients of ModelOutputs(net, layer=40) is different from net.features[40].weight.grad.

ptrblck · July 15, 2019, 10:22pm

That might be the case and I suspect you might be registering the weight parameter and overriding it with the bias parameter later, since both will match the condition.
Could you check the shape and see, if you are in fact storing the bias gradient?

AndrewSoul · July 16, 2019, 1:54am

Hi, @ptrblck. I have checked the shape. The ConvNd weight.shape is defined as：

self.weight = Parameter(torch.Tensor(
out_channels, in_channels // groups, *kernel_size))

So the shape of features[40].weight.grad is the same as weight’s shape.

(C_out,C_in//groups, *kernel_size)

ModelOutputs(net, layer=40): shape: (N, C_out, H_out, W_out)
I think it explains why the shape is different. Am I right?

ptrblck · July 16, 2019, 9:21am

Maybe I misunderstood the issue, but I thought you had a shape mismatch comparing the two gradient tensors?
Are the shapes equal?

AndrewSoul · July 17, 2019, 2:31am

Sorry, maybe I wasn’t clear enough. I didn’t mean to mislead you. And thanks for your patience!

grads = []
def save_grad(grad):
    grads.append(grad)

# Create model
model = models.vgg16()
model.eval()

# First approach
x = torch.randn(1, 3, 224, 224)
output = model.features(x)
output.register_hook(lambda x: save_grad(x))
output = model.avgpool(output)
output = output.view(output.size(0), -1)
output = model.classifier(output)
output.mean().backward()

# Reset
model.zero_grad()

# Second approach
output = x.clone()
for name, module in model.features._modules.items():
    output = module(output)
    if '30' in name:
        output.register_hook(lambda x: save_grad(x))

output = model.avgpool(output)
output = output.view(output.size(0), -1)
output = model.classifier(output)
output.mean().backward()

# Compare gradients
print((grads[0] == grads[1]).all())
> tensor(1, dtype=torch.uint8)

The code you provide above yield the same gradients. That is, grads[0] and grads[1] are the same. But both grads[0] and grads[1] are different from model.features[30].weight.grad in shape. Because the definition of ConvNd’s weight is:

So the shape of model.features[30].weight.grad and shape of grads[0] /grads[1] are different, I think.

ptrblck · July 17, 2019, 10:07am

I might still misunderstand the use case, but in your code example you are storing the gradients of the output of self.features:

    def forward(self, x):
        x = self.features(x)
        x.register_hook(self.save_gradient)

, while you are now comparing these gradients to the weight.grad.

Also, which model are you using?
I assumed it’s vgg16 and if that’s the case, features[30] is an nn.MaxPool2d layer, without any weight and bias parameters.

AndrewSoul · July 18, 2019, 2:23am

Sorry, I didn’t mean to confuse you. I made a mistake in my last post.

I’m using vgg16 for cifar10 and I choose the same layer to compare the gradients.

Yeah, as you said, there isn’t any weight and bias parameters in features[30]. So I choose features[37], which is Conv2d layer in my model.

I defined m_o like:
m_o = ModelOutputs (model, target_layers=['37'])
So I compared the gradient of ModelOutputs in layer[‘37’] and model.features[37].weight.grad.
The shapes are different, like I said in my last post:

And m_o.gradients: shape: (N, C_out, H_out, W_out)
Maybe, that can explain the difference in shape? Did I make myself understood?

ptrblck · July 18, 2019, 10:13am

vgg16 does only have 30 layers on .features. Are you using vgg16_bn?

Maybe, because there is still some mixup between model architectures?
As you can see in my dummy example, you’ll get the same gradients with the same shapes, if you use the same model architecture.

Could you make sure to compare exactly the same models?
Try to print the modules using print(model) and also compare the number of parameters.
If you have a shape mismatch in a certain layer, it means that the model architecture is different.

AndrewSoul · July 18, 2019, 11:39am

Yes, I’m using vgg16_bn for cifar10:

odict_items([('features', Sequential(
  (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (2): ReLU(inplace)
  (3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (5): ReLU(inplace)
  (6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (7): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (8): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (9): ReLU(inplace)
  (10): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (11): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (12): ReLU(inplace)
  (13): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (14): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (15): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (16): ReLU(inplace)
  (17): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (18): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (19): ReLU(inplace)
  (20): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (21): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (22): ReLU(inplace)
  (23): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (24): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (25): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (26): ReLU(inplace)
  (27): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (28): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (29): ReLU(inplace)
  (30): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (31): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (32): ReLU(inplace)
  (33): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (34): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (35): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (36): ReLU(inplace)
  (37): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (38): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (39): ReLU(inplace)
  (40): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (41): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (42): ReLU(inplace)
  (43): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (44): AvgPool2d(kernel_size=1, stride=1, padding=0)
)), ('classifier', Linear(in_features=512, out_features=10, bias=True))])

I just use the same model as mentioned above. And I compare the features[37] with same model using m_o.gradients and model.features[37].weight.grad.