Beginner: Should ReLU/sigmoid be called in the init method?

mohit117 · May 21, 2020, 8:29am

Oh, Now I am bit confused. I have explained my use case in the Appendix. But I am now more worried about my understanding of PyTorch rather than this particular case. So kindly guide me in the following general question and the exact implementation of the paper listed in Appendix I shall manage if I get this point correct.

In the forward definition of a network model say there is some activation ‘x’

def forward(self,x):
    ...
    x = x+1 or say torch.log(x) or say 5*x
    ...

I am sure backpropagation will easily happen in this case through x

Then why it should not happen in the following case through the weights and bias

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        conv = nn.Conv2d(1, 6, 3)
        self.weight = **nn.Parameter**(conv.weight)
        self.bias = nn.Parameter(conv.bias)

    def modify_conv_weights(self, weights, bias):
        weights = weights+1
        bias  = 15 + torch.log(bias)
        return weights, bias
        
    def forward(self, x):
        x = F.conv2d(x, self.weight, self.bias)
       self.weight, self.bias = modify_conv_weights(self.weight, self.bias)
        return x

What my understanding says PyTorch should be able to easily backpropagate in this case also.
Yours sincerely

APPENDIX
Kindly ignore the exact details mentioned in this paper. I am confident I shall implement the new layer defined in this paper if I understand the above issue correctly. In this paper the authors propose a new conv layer such that after each forward pass, they do some modification such that the central weight is always positive, all other weights are negative and all sum to 0.

ptrblck · May 22, 2020, 7:06am

The difference is that the first case uses the activations to calculate the gradient of the weight w.r.t the loss.
In the second approach you are recreating the parameter, so the original weight tensor, which was used to calculate the output, doesn’t exist anymore. The new weight parameter was never used in the calculation of the current output.

I’m not familiar with the paper, but do you know when exactly the authors are manipulating the parameters?

mohit117 · May 22, 2020, 10:31am

In the second approach you are recreating the parameter, so the original weight tensor, which was used to calculate the output, doesn’t exist anymore. The new weight parameter was never used in the calculation of the current output.

Yes I agree that new weights are not used in current activation. But in just the next forward pass the new weights will be used and hence back propagation would happen. Am I correct? In short in every nth forward pass weights will be modified and the weights modified in nth-1 forward pass will be used. And hence when back propagation happens for nth+1 forward pass the weights updated in nth forward pass will be used.

I hope I am correct and not confused you a lot.

but do you know when exactly the authors are manipulating the parameters?

Weight updation happens after each forward pass.

Thankyou

saba · July 29, 2020, 11:12pm

HI Ptrblck,

I need to apply sigmoid function element wise for the input <1. Is this code correct?

import torch
import torch.nn as nn

class Rectifier(nn.Module):
    def __init__(self):
        super(Rectifier, self).__init__()

    def forward(self, input):
        Out=torch.zeros(input.shape[0],input.shape[1])
        for ii in range(input.shape[0]):
            for ii1 in range(input.shape[1]):
                        if input[ii,ii1]<=1:
                             Out[ii,ii1]=1/(1+torch.exp(-1*input[ii,ii1]))
                        elif  input[ii,ii1]>1:
                             Out[ii,ii1]=1
        
        return Out

BarCodeReader · September 11, 2020, 2:36am

@ptrblck @mohit117 I am confused after read through all these,

So how about below simple case:

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.conv = nn.Conv2d(1, 6, kernel_size=3, padding=1, bias=False)

    def double_weight(self, weights):
        weights *= 2
        return weights
        
    def forward(self, x):
        out1 = self.conv(x)

        weight = self.double_weight(self.conv.weight) #so here weight still has grad
        out2 = F.conv2d(x, weight, padding=1)

        return out1+out2

so in above code, we have 2 braches, branch1 is a normal convolution, while for each forward pass, branch2 will copy branch1’s weight and multiply by 2, convolve, then sum up and return.

so, for the backward pass, I would expect the gradient update for branch 1(G1’) should be G1’ = G1 + G2, where G1 is from self.conv and G2 is from F.conv2d since branch2 copy branch1’s weight. The relationship might not be exactly G1+G2 but G2 will contribute to branch1’s gradient update also, right?

If I claim the weight as a nn.Parameter

weight = nn.Parameter(self.double_weight(self.conv.weight))

weight is still attached to my computation graph, but is not connected to self.conv anymore, and thus for the backward pass, branch 2 will not contribute to branch 1.

Is my understanding correct??

mohit117 · September 11, 2020, 4:30am

Hello @BarCodeReader ,
I completely agree with what you have written in bold.

Regarding the

I theoretically they should contribute to branch 1.

By the theory of Deep learning, when loss is calculated on out1+out2 the backpropagation traces all the parameters which contributed to the final activation. I hope you agree with me that whether nn.parameter is used or not, conv.weights definitely affect out2 (which is part of final activation). Thus in either case the baclpropagation in theory should modify self.weight.

And Iam quite confident the PyTorch does that for I have had very good experience with PyTorch and find it relatively free from bugs.

BarCodeReader · September 11, 2020, 7:55am

Hi @mohit117
Thanks for your reply!

I did a simple test and seems by claiming weight=nn.Parameter(self_double_weight(self.conv.weight)) ,
the weight will no longer connect to self.conv.weight and it will not contribute to self.conv anymore.

class Test1(nn.Module):
    def __init__(self, seed):
        super(Test1, self).__init__()
        # fix the seed
        random.seed(seed)
        torch.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        # init conv
        self.conv = nn.Conv2d(1, 3, kernel_size=3, padding=1, bias=False)

    def double_weight(self, weights):
        weights *= 2
        return weights
        
    def forward(self, x):
        out1 = self.conv(x)

        weight = nn.Parameter(self.double_weight(self.conv.weight))
        out2 = F.conv2d(x, weight, padding=1)

        return out1+out2

we can have another test as a validation(I omit those same codes and only show the difference here)

class Test2(nn.Module):
    def __init__(self, seed):
        ...

    def double_weight(self, weights):
        ...
        
    def forward(self, x):
        out1 = self.conv(x)

        weight = self.double_weight(self.conv.weight).detach()
        out2 = F.conv2d(x, weight, padding=1).detach()

        return out1+out2

so for test 2, since we detach the weight, of course it will not contribute to the gradient update of self.conv, thus, if the gradient update of self.conv in test1 the same as test2, then means nn.Parameter() is not connected to self.conv anymore, if gradient update in test 1 is different from test 2, then means nn.Parameter() is still contributing to self.conv.

then you can create a very simple forward and backward pass, and print out the conv weight before and after.

t1 = Test1(seed=123)
t2 = Test2(seed=123)

print(t1.conv.weight)
>>[[[[-0.1359, 0.0110, -0.1656], ...]]]

print(t2.conv.weight)
>>[[[[-0.1359, 0.0110, -0.1656], ...]]]

after 1 forward and backward pass:

optimizer SGD, lr=0.1, momentum=0.9, weight_decay=1e-4
loss: MSE
input: torch.ones([1,1,3,3])
gt: torch.zeros([1,3,3,3])

lets print out the weight after the forward backward pass:

print(t1.conv.weight)
>>[[[[1.1843, 1.8497, 1.0289], ...]]]

print(t2.conv.weight)
>>[[[[1.1843, 1.8497, 1.0289], ...]]]

they are still the same…so means nn.Parameter is not connected to self.conv anymore.

if you don’t mind, please help me to verify this and thanks for your reply again!

mohit117 · September 11, 2020, 8:50am

Hello @BarCodeReader
Your results match with theory and if you allow me, cannot be used to support your conclusion.

For example in a CNN suppose you have 5 layers back to back ie o/p of first layer fed to second layer…

Now suppose in PyTorch you set requires_grad of say layer 3 as False for exp1 and True for exp2.

Now given everything else is same, after the first baclprop you print weights if layer1. They will be exactly same !!

This is becauses by detaching layer 3 prevents updating its parameters and not of layer 1. For layer 1 exactly same backpropagation mechanism shall happen for both cases.

Returning to your experiment since you print conv.weight I would have expected the same weights in both test1 and test2 as is confirmed from your experiment.

Probably define weight in the init part and then re-conduct the experiment. But this time also print weight. Please modify the experiments along these lines and this should put more light.

Thankyou

100deep1001 · August 19, 2021, 2:19pm

@ptrblck you are unbelievable!

One of the reasons to switch to PyTorch is surely because of people like you!

mfa · January 27, 2022, 8:51am

@ptrblck I was wondering in case you had multiple conv layers, would you use the same self.act on the output of all those layers or you’d create one act per conv layer?

ptrblck · January 27, 2022, 9:39am

I think it depends a bit on the use case and if and how you would like to change the model architecture in the future.
E.g. defining a single activation and reusing it would allow you to simply replace it with another activation method to run more experiments (e.g. would a leaky relu improve the performance compared to a

mfa · January 27, 2022, 10:11am

@ptrblck thanks for your answer. Assuming that you know that every layer is better off with e.g. relu, using the same instance of a relu activation layer for diffirent conv layers is not an issue, right? E.g. self.act2 = nn.ReLU() in the approach below is redundant right?

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 3, 1, 1)
        self.conv2 = nn.Conv2d(3, 6, 3, 1, 1)
        self.act1 = nn.ReLU()
        self.act2 = nn.ReLU()
        
    def forward(self, x):
        x = self.act1(self.conv1(x))
        x = self.act2(self.conv2(x))
        return x

ptrblck · January 27, 2022, 5:31pm

Yes, you could reuse self.act1 in the second call.

Beginner: Should ReLU/sigmoid be called in the __init__ method?

Beginner: Should ReLU/sigmoid be called in the init method?