Does PyTorch API detach() removes the layer from further computation for that layer in the forward pass

panda · January 16, 2018, 7:55pm

I have the following code.

self.bn1 = nn.BatchNorm2d(2)
self.conv1 = nn.Conv2d(2, 20, kernel_size=1)

self.bn2 = nn.BatchNorm2d(20)
self.conv2 = nn.Conv2d(20, 40, kernel_size=3)

self.active = True

def forward(self, x):

    if not self.active:
        self.eval()
        out = self.conv1(F.relu(self.bn1(x)))
        out = self.conv2(F.relu(self.bn2(out)))

        out.detach()

Does out.detach() mean that, for next iteration, the forward pass will not be computed?
I know detach will prevent the back propagation computation.

colesbury · January 16, 2018, 9:49pm

No. detach() is a function on Variables, not nn.Modules (layers). It returns a new Variable that does not back-propagate to whatever detach() was called on.

So if you do:

x = out.detach()

You can use x in any differentiable computation and it will not back-propagate to out. Note that the call to deatch() does not change out.

jpeg729 · January 17, 2018, 11:59am

In the code you posted the result of out.detach() is not stored or used, so has no effect.

For info, there is an in-place version of detach out.detach_() that tells pytorch not to try to backpropagate furthur.

panda · February 27, 2018, 1:13am

Thank you, I modified the code to use the in-place version of detach.
I match the detach generated with the input and store them in a list.
Next time, when the same input is presented, I return the stored detach as the output from this layer would be same as there was no back propagation.

But, with this approach my error rate remains quite high (98%) irrespective of the number of epochs or training size, where as, if I compute the output each time and then perform detach then the error rate goes down gradually.

My question is, if calling detach prevents back propagation and the parameters of the layer remain the same then for the same input presented again should produce the same output/detach and should match with my stored detach for that input. But, it seems it does not. Anything, wrong with this approach?

The modified code is

def forward(self, x):

    if not self.active:
        self.eval()
        out = self.conv1(F.relu(self.bn1(x)))
        out = self.conv2(F.relu(self.bn2(out)))

        out.detach_()

# store this detach in a list matching with the input. Next time, if there is a detach corresponding to the input then return that else compute the output/detach as done here and return it.

        return out

jpeg729 · February 27, 2018, 2:04am

I really do not understand what you are trying to achieve. Could you post the entire forward function and any code that might “reuse the stored detach” as you put it?

For info, there is a button for formatting code. It looks a little like this [ </> ].

panda · February 27, 2018, 3:43am

Thank you for your quick response. I am trying to save on the computation time of a layer, which has been detached from the back propagation.

The original code is from the paper
FreezeOut: Accelerate Training by Progressively Freezing Layers (https://arxiv.org/abs/1706.04983)

My modified code is given below

class Bottleneck(nn.Module):
    def __init__(self, nChannels, growthRate,layer_index, train_size, test_size, batch_sz):
        super(Bottleneck, self).__init__()
        interChannels = 4*growthRate
        self.bn1 = nn.BatchNorm2d(nChannels)
        self.conv1 = nn.Conv2d(nChannels, interChannels, kernel_size=1,
                               bias=False)
        self.bn2 = nn.BatchNorm2d(interChannels)
        self.conv2 = nn.Conv2d(interChannels, growthRate, kernel_size=3,
                               padding=1, bias=False)

        # If the layer is still being trained
        self.active=True
        
        # The index of this layer relative to the overall net
        self.layer_index=layer_index
        self.save_out = False

#Counter to keep track of which batch is being processed
        self.counter = 0
        self.train_size = train_size
        self.test_size = test_size
        self.batch_sz = batch_sz
 
        remainder = train_size % batch_sz

        self.maxCounter = train_size//batch_sz


        if(remainder != 0):
            self.maxCounter += 1
			
# List to hold the output/detach of inactive layers            
        self.out = [None] * (self.maxCounter)

        
    def forward(self, x):
    
        
        # If we're not active, return a detached output to prevent backprop.
        if self.active: 
            out = self.conv1(F.relu(self.bn1(x)))
            out = self.conv2(F.relu(self.bn2(out)))
            out = torch.cat((x, out), 1)
            self.counter+=1
            if(self.counter >= self.maxCounter):
                self.counter = 0      
            return out

        else:
# Layer has become inactive, check if we have already the output/detach for the current input, if not store it         
            if (self.out[self.counter] is not None):
                detach = self.out[self.counter]
                self.counter+=1
                if(self.counter >= self.maxCounter):
                    self.counter = 0
					
				detach.volatile = False	
                return detach
                 
            else:
                out = self.conv1(F.relu(self.bn1(x)))
                out = self.conv2(F.relu(self.bn2(out)))
                out = torch.cat((x, out), 1)
# In place detach 			
                out.detach_()

                self.out[self.counter] = out
                self.counter+=1
                if(self.counter >= self.maxCounter):
                    self.counter = 0            

                return out

jpeg729 · February 27, 2018, 6:59am

If I understand this correctly, then, if self.active the model should produce an output and allow backpropagation as usual.
If self.active == False then the model should run only once, detach the output from the computation graph and store it for future use.

If that is what you want then your code is fine.

One remark though, the detached output from this line

detach = self.out[self.counter]

ought to be called detached not detach because detach is a verb, not a noun, and won’t convey any useful meaning to anyone with good english. That is partly why I was confused by your question and needed to see your code.

panda · February 27, 2018, 3:11pm

Thank you for your quick response and sorry for the confusion of using the verb.

But, I am not getting the expected result.
My question is, when we detach the output from the computation graph, the parameters of that layers are not updated in the back propagation. So, it should not matter if I compute the output again and call detach on that to prevent back propagation or return the stored detached output to be used by the next layer.

If I do not use the code

if (self.out[self.counter] is not None):
                detach = self.out[self.counter]
                self.counter+=1
                if(self.counter >= self.maxCounter):
                    self.counter = 0
					
				detach.volatile = False	
                return detach

and each time compute the output when self.active = False as given below, I get different expected result.

                out = self.conv1(F.relu(self.bn1(x)))
                out = self.conv2(F.relu(self.bn2(out)))
                out = torch.cat((x, out), 1)
# In place detach 			
                out.detach_()

                return out

My understanding was once I have detached output from back propagation, the outputs from both the code snippets above would be the same. But, it seems they are not.

My main purpose is to save on the output computation time of a layer, if I am not training that layer(removed from back propagation).

jpeg729 · February 27, 2018, 4:34pm

I can’t see any obvious errors in your code.

The batchnorm layers keep running estimates of mean and variance, and those stats may vary from one epoch to the next. I can’t see any other reason why the computed output should differ from the saved output.