Forward and backward about pytorch

Hi, I want to ask about the difference between the following two pieces of code:

class ModelOutputs():
    """ Class for making a forward pass, and getting:
    1. The network output.
    2. Activations from intermeddiate targetted layers.
    3. Gradients from intermeddiate targetted layers. """

    def __init__(self, model, target_layers):
        self.model = model
        self.target_layers = target_layers
        self.gradients = None

    def save_gradient(self, grad):
        self.gradients = grad

    def __call__(self, x):
        conv_outputs = None
        for name, module in self.model.features._modules.items():
            x = module(x)
            if name in self.target_layers:
                conv_outputs = x
        output = x
        output = output.view(output.size(0), -1)
        output = self.model.classifier(output)
        return output, conv_outputs

I passed vgg to model in ModelOutputs()

class VGG(nn.Module):
    def __init__(self, vgg_name):
        super(VGG, self).__init__()
        self.features = self._make_layers(cfg[vgg_name])
        self.classifier = nn.Linear(512, 10)
        self.gradients = None

    def save_gradient(self, grad):
        self.gradients = grad

    def forward(self, x):
        x = self.features(x)
        inter = x
        x = x.view(x.size(0), -1)
        x = self.classifier(x)
        return x, inter

I’m just curious about what the difference between the two pieces of code above for backward and forward is? When the loss backward, will the gradients be different?
Could someone help me?
Appreciated for your help!

You could use some dummy data and target, calculate the loss and gradients, and compare the gradients of (some) layers using something like this:


I would recommend using the second approach, since the first one is not defined as an nn.Module, which will yield errors, e.g. if you try to call model.parameters() etc.

1 Like

Appreciated for your reply!

m_o = ModelOutputs(net, layer=40)
output, conv_output = m_o(input)
loss = cross_entropy(output, label)
grad_m_o = m_o.gradients
grad_net = net.features[40].weight.grad

I find the grad_m_o is different from grad_net, why?

And I got another proplem: I want to load model in DataParallel and use multi-GPU in one server. like this:

os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2'
device = 'cuda' if torch.cuda.is_available() else 'cpu'
... ...
net= torch.nn.DataParallel(model) 

And I use the first approach in the code above.

class ModelOutputs():

m_o = ModelOutputs(net, layer)
output, conv_output = m_o(input)

I passed the net in ModelOutpus. When I ran the code, it did’t use multi-GPU but gpu’0.
Is the problem caused by not defined as nn.Module?

How large is the difference?
I’m not sure, if the code you’ve posted is still the currently used one, but it seems to be that you should pass targets_layers as a list of strings.

I think you are not using multiple GPUs, since you are never calling the model itself (and thus its forward method), but split the call into submodules and call them in a loop.
Could you try to call net(input) and check the GPU utilizations?

Sorry to reply late. Yeah, I passed string actually, like: [‘40’].

The shape between the m_o.gradients and net.features[40] is different.

Yes, you’re right. I didn’t call net(input) directly. How can I use multiple GPUs using ModelOutput? Inheriting class nn.Module and define a forward function?

I’m not sure, why the shapes differ, but apparently the wrong gradients are stored.
Here is a small dummy example using vgg16:

grads = []
def save_grad(grad):

# Create model
model = models.vgg16()

# First approach
x = torch.randn(1, 3, 224, 224)
output = model.features(x)
output.register_hook(lambda x: save_grad(x))
output = model.avgpool(output)
output = output.view(output.size(0), -1)
output = model.classifier(output)

# Reset

# Second approach
output = x.clone()
for name, module in model.features._modules.items():
    output = module(output)
    if '30' in name:
        output.register_hook(lambda x: save_grad(x))

output = model.avgpool(output)
output = output.view(output.size(0), -1)
output = model.classifier(output)

# Compare gradients
print((grads[0] == grads[1]).all())
> tensor(1, dtype=torch.uint8)

I tried to stick to your approach and as you can see, both methods yield the same gradients.

I guess data parallel isn’t being used, since you are not calling the model directly, but each submodule.
This might be related to this issue.

Thanks for your quick reply!

Yes, with register_hood. Both method yield the same gradients. Before this, I have tried:

I think the gradients of ModelOutputs(net, layer=40) is different from net.features[40].weight.grad.

That might be the case and I suspect you might be registering the weight parameter and overriding it with the bias parameter later, since both will match the condition.
Could you check the shape and see, if you are in fact storing the bias gradient?

Hi, @ptrblck. I have checked the shape. The ConvNd weight.shape is defined as:

self.weight = Parameter(torch.Tensor(
out_channels, in_channels // groups, *kernel_size))

So the shape of features[40].weight.grad is the same as weight’s shape.

(Cout,Cin//groups, *kernel_size)

ModelOutputs(net, layer=40): shape: (N, Cout, Hout, Wout)
I think it explains why the shape is different. Am I right?