How to split backward process wrt each layer of neural network?

zazzyy · September 8, 2017, 9:52pm

Hi everyone,

I’m working on a project that requires me to have access to each step of backward propagation during the training process. Say I have a 10 layer fully connected neural net (input->fc1->fc2->…->fc10->output), and during the backward process I want something like output.backward()->fc10.backward()->fc9->backward()->…->fc1.backward() in separate steps so that I can get gradient of each layer and check how much time it costs for computing gradient of each layer.

However, for now, when I call loss.backward() using this backward function (https://github.com/pytorch/pytorch/blob/master/torch/autograd/init.py#L46 ), then only the loss variable is contained in the variables, and pushed into the execution engine.

How can I get access to gradient computation process of each parameters (layers) in the network? I really want something like

for layers in reversed(network):
    layer_grad=layers.backward()
    # and at here I can check the time cost of gradient computing of a single layer
   layers.update(layer_grad) # based on certain optimizer

Any information will be appreciate.

smth · September 9, 2017, 1:00am

you can use backward hooks for:

getting access to the gradient of each of params
checking time taken for computing each layer

http://pytorch.org/docs/master/nn.html?highlight=hook#torch.nn.Module.register_backward_hook
http://pytorch.org/docs/master/autograd.html?highlight=hook#torch.autograd.Variable.register_hook

zazzyy · September 10, 2017, 12:00am

Hi @smth,

Thanks a lot for the pointers you provide. I also have other questions related, hope you can provide some information.

My project involve solving some straggler (slow worker) issues in distributed cluster. So, I want to “skip” backward calculation (gradient computing) for some layers in the network.

Say if I run the training code on a certain node in a cluster, and after I do the backward process at a certain layer(layer_10.backward -> layer_9.backward -> layer_8.backward), at this time I decided this node is too slow and I want to just simply skip calling backward step of remaining layers to avoid more time costs (e.g. by simply assign a certain value to gradients of those remaining layers rather than actually calculate the gradients) .

Is this possible in current pytorch API? Or I need to customize modules (say convolutional layer)?

smth · September 10, 2017, 3:48pm

What you are asking for is not strictly possible without writing custom autograd.Function functions to insert these custom nodes. Either that, or you do some book-keeping and manage the graph yourself.

For example:

model = nn.ModuleList(nn.Linear(100, 200), nn.ReLU(), nn.Linear(200, 300))
x = Variable(torch.randn(10, 100), requires_grad=True)

def model_forward(model, x):
    for m in model:
        x = model(x)
        x.detach_()
    return x

def model_backward(model, grad_output):
    for m in reversed(model):
        if TOO_LATE:
            return # shortcut outside of backward
        grad_output = m.backward(grad_output)
    return grad_output

Something of this order. It’s super hacky, and you have to do all model book-keeping yourself.

zazzyy · October 2, 2017, 3:48pm

Hi @smth,

I tried this method you provided with following code when defining my customized module list (a simple LeNet example here) and forward, backward operation:

 class LeNetLayerSplit(nn.Module):
    def __init__(self):
        super(LeNetLayerSplit, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5, 1)
        self.conv2 = nn.Conv2d(20, 50, 5, 1)
        self.fc1 = nn.Linear(4*4*50, 500)
        self.fc2 = nn.Linear(500, 10)
        self.ceriation = nn.CrossEntropyLoss()
        self.module_list_0 = nn.ModuleList([self.conv1, nn.MaxPool2d(2, stride=2), nn.ReLU(), 
        					self.conv2, nn.MaxPool2d(2, stride=2), nn.ReLU()])
       	self.module_list_1 = nn.ModuleList([self.fc1, self.fc2])

       	self._name = "LeNet_layer_split"

    def forward(self, x, target):
    	for sub_module in self.module_list_0:
    		x = sub_module(x)
    		x.detach_()
    	x = x.view(-1, 4*4*50)
    	for sub_module in self.module_list_1:
    		x = sub_module(x)
    		x.detach_()
    	loss = self.ceriation(x, target)
    	return x, loss

    def backward(self, grad_output):
    	for m in reversed(self.module_list_1):
    		grad_output = m.backward(grad_output)
    	grad_output.view(-1, 50, 4, 4)
    	for n in reversed(self.module_list_0):
    		grad_output = n.backward(grad_output)
    	return grad_output

When calling this model, I used the following code:

def build_model(self):
        self.network = LeNetLayerSplit()
        # this is only used for test
        self.optimizer = torch.optim.SGD(self.network.parameters(), lr=self.lr, momentum=self.momentum)

def train(self, train_loader=None):
        self.network.train()
        # iterate of epochs
        for i in range(self.max_num_epochs):            
            for batch_idx, (data, y_batch) in enumerate(train_loader):
                iter_start_time = time.time()
                data, target = Variable(data, requires_grad=True), Variable(y_batch)
                self.optimizer.zero_grad()
                logits, loss = self.network(data, target)
                print("Trial Loss: {}".format(loss.data[0]))
 
                print("Start Backward Prop Process: ")
                loss.backward()

But I get the error of RuntimeError: there are no graph nodes that require computing gradients. I guess I call the backward function in a wrong way, and simple search returns no related issue.
But when I read the original code of autograd variable, I found that in this line https://github.com/pytorch/pytorch/blob/master/torch/autograd/variable.py#L235 results generated during the forward process are set to requires_grad=False when calling detach_, is this issue caused by that? If so, how can I solve it? Please provide me more details about this.

Thanks a lot!

smth · October 6, 2017, 3:40pm

@zazzyy if you call detach_() then there is no node in the graph that requires_grad=True, so autograd is complaining that it has no work to do.

What you might want to do is (maybe), instead of x.detach_(), call x = Variable(x.data, requires_grad=True) (or some form of this, that will compute gradients).

zazzyy · October 9, 2017, 7:14pm

Hi, @smth thanks a lot for your response. Based on your suggestions, I tried the following things bellow:

I tried to remove every x.detach_() in the foregoing code and simply called loss.backward(), then the model works as the normal condition and my customized backward() function are not called.

def backward(self, grad_output):
    	for m in reversed(self.module_list_1):
    		grad_output = m.backward(grad_output)
    	grad_output.view(-1, 50, 4, 4)
    	for n in reversed(self.module_list_0):
    		grad_output = n.backward(grad_output)
    	return grad_output

When I tried to call the customized backward() function in this manner, self.network.backward(grad_output=${RANDOM_VARIABLE}), then I got this error:

    grad_output = m.backward(grad_output)
  File "/home/usr/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 262, in __getattr__
    type(self).__name__, name))
AttributeError: 'Linear' object has no attribute 'backward'

Dose this mean I need to overwrite the corresponding nn.Module by myself with a backward function, or there is a way that the current Module will do that for me? So, in general, I only want to call the backward process layer by layer, after getting the gradient of each layer, I just want to do a simple check (e.g. the value of a certain variable) and to determine if I need to do or just skip the remaining backward process.

smth · October 10, 2017, 3:41pm

in your backward function, you are calling .backward on m which is the Module. what you need to do is to call backward on the Variable that is the output from the module.

zazzyy · October 10, 2017, 4:01pm

Hi, @smth. Thanks a lot for this pointer.

I already tried what you suggest that call the .backward() on the output Variable that is output from the module. The case is, if I do the normal forward() manner like this way:

    def forward(self, x, target):
    	for sub_module in self.module_list_0:
    		x = sub_module(x)
    	x = x.view(-1, 4*4*50)
    	for sub_module in self.module_list_1:
    		x = sub_module(x)
    	loss = self.ceriation(x, target)
    	return x, loss

Then once I call the variable.backward() on the last variable, say loss.backward() then the whole backward process will be executed. But if I call anything like x = Variable(x.data, requires_grad=True) after each forward step as you mentioned, then it seems no grad will be calculated if I check param.grad in module.parameters().
After checking this topic Assign manual assigned "grad_output", it seems call variable.backward(grad_output) can be helpful, but when I calling loss.backward(grad_output) under normal forward manner, the behavior is nothing different from loss.backward().
What tricks do I need in the forward/backward process to achieve executing backward process layer by layer (e.g. get output from last layer and use it to do backward for next layer, like doing the chain rule manually )?

Thanks a lot!

smth · October 11, 2017, 3:36am

here’s a more precise and fuller example. What you are doing in my example is to completely avoid autograd’s automatic backward computation and manually reverse-computing the backward graph.

For anyone coming here with a search, my solution is a hack, it is not good practice. it is given as an illustration just to showcase to @zazzyy how to shortcut these things

import torch
import torch.nn as nn
from torch.autograd import Variable

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.layers = nn.ModuleList([
            nn.Linear(10, 10),
            nn.Linear(10, 10),
            nn.Linear(10, 10),
            nn.Linear(10, 10),
        ])

    def forward(self, x):
        self.output = []
        self.input = []
        for layer in self.layers:
            # detach from previous history
            x = Variable(x.data, requires_grad=True)
            self.input.append(x)

            # compute output
            x = layer(x)

            # add to list of outputs
            self.output.append(x)
        return x

    def backward(self, g):
        for i, output in reversed(list(enumerate(self.output))):
            if i == (len(self.output) - 1):
                # for last node, use g
                output.backward(g)
            else:
                output.backward(self.input[i+1].grad.data)
                print(i, self.input[i+1].grad.data.sum())

model = Net()
inp = Variable(torch.randn(4, 10))
output = model(inp)
gradients = torch.randn(*output.size())
model.backward(gradients)

zazzyy · October 13, 2017, 12:37am

Hi, @smth. I deeply appreciate this working snippet, it is basically what I asked.

I have one further question about this. It seems the grad_output we provide in this line

gradients = torch.randn(*output.size())

are never used to compute anything during the backward process since when I tried to print every self.output[i+1].grad.data, they’re all equal to the random gradient generated by torch.randn(*output.size()). And everything in self.output has grad_fn=None. That make sense because as you mentioned this hack “ompletely avoid autograd’s automatic backward computation and manually reverse-computing the backward graph”.

My question is if I want to make this net works “normally”, do I need to manually handle the entire backward process (e.g. write all backward functions for all layers, manually compute gradients, and etc.)? Is there any way to make this less hacky or any function (e.g. backward function of Conv layer) to borrow?

smth · October 14, 2017, 4:21am

I’m sorry the example had a bug. I’ve fixed it now (see my example again) and it correctly computes gradients.

samarth-robo · September 29, 2018, 5:40pm

@smth can you use hooks to measure the time taken for the backward pass of a module?

Smart_Zhang · June 2, 2019, 9:09am

Is there any examples of measuring the backward propagation time of a layer by using hooks?

maralm · September 20, 2019, 10:17pm

May I know why it is not good practice? Are there other ways of splitting the layers?