Why doesnt these variables get updated?

Shisho_Sama · July 21, 2019, 7:54am

Hello all, I created a simple network where a convolutional layers weight matrix is altered by a custom function.
I came up with this :


class snet(nn.Module):
    def __init__(self, num_classes=3):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 6, 2, 1, 0)
        shape = self.conv1.weight.shape
        self.var1 = nn.Parameter(torch.ones(shape))
        self.var2 = nn.Parameter(torch.ones(shape))

        self.conv2 = nn.Conv2d(6, 6, 5, 1, 0)
        self.fc = nn.Linear(6*11*11, num_classes)

    def some_method(self):
        """
            Suppose this is a custom method, tasked with
            producing values for each entry in the weight matrix
            of a convolutional layer. for simplicity we used 
            addition here
        """
        return self.var1 + self.var2

    def forward(self, input):
        self.conv1.weight = nn.Parameter(self.some_method())
        output = self.conv1(input)
       # or using the functional api and using the weight matrix directly is no different
      # output = F.conv2d(input, self.some_method())
        output = self.conv2(output)
        output = output.view(input.size(0), -1)
        output = self.fc(output)
        return output


n = snet(num_classes=3)
fake_dataset = torchvision.datasets.FakeData(100,
                                             image_size=(3, 16, 16),
                                             num_classes=3,
                                             transform=transforms.ToTensor())
fake_dataloader = torch.utils.data.DataLoader(fake_dataset,
                                              batch_size=20)
criterion = nn.CrossEntropyLoss()


opt = torch.optim.Adam(n.parameters(), lr=0.01)
for imgs, labels in fake_dataloader:
    p = n(imgs)
    loss = criterion(p, labels)
    opt.zero_grad()
    loss.backward()
    opt.step()
    print(loss.item())

Apparently this is wrong as nothing happens! the parameters are added to the module and they show up in the parameters list. however, the gradient is always zero!
I noticed the grad_fn property for both variables/parameters are None! where as it must have been the addition right?
Based on the autograd tutorial, when one variable in an operation has requires_grad = True, the output also will have itsrequires_grad = True and thus the gradient should flow back to those withrequires_grad set to True.!
Since the nn.Parameter() sets this property implicitly to True, this should work yet it does not!
whats wrong here? what am I missing here?
Any help is greatly appreciated.

Krish · July 21, 2019, 6:09pm

var1 and var2 have no grad_fn because you are not performing any operation that changes their values.

I believe that the conv1 weight doesn’t have a grad_fn because some_function() (or any python function) returns the value rather than the reference. So although the value gets updated, the lack of reference means Pytorch can’t know what changed the value.

Shisho_Sama · July 21, 2019, 6:56pm

Thanks a lot for your response, but I thought since they are being used in an operation, so they must have a gradient in any case right?
If not, how am I supposed to introduce a dependent variables?
For example in my case, what should I be doing ? I’m really lost here!

Krish · July 22, 2019, 2:56am

They do take part in an operation but Pytorch doesn’t know what operation it is, because it just gets the value from the function.
What you can try in this case is to perform the computation of some_method inside the init or forward method.
I will try this out and let you know if that worked.

Shisho_Sama · July 22, 2019, 4:01am

Thanks a lot , its greatly appreciated

Krish · July 22, 2019, 3:04pm

nn.Parameter doesn’t seem to transfer any kind of grad information. One workaround that I could get to work was to use the torch.nn.functional API and passing the kernel through there.

def some_function(self):
    self.kernel = self.var1 + self.var2

def forward(self, x):
    self.some_function()
    output  = F.conv2d(x, self.kernel, padding = 0, stride = 1)

Here grad_fn works as usual.
You do have to figure out the kernel size by yourself though.

Shisho_Sama · July 22, 2019, 3:33pm

Thanks but this only acts as a one time initialization , its as if I write :

def __init__(self):
    super().__init__()
    self.conv1 = nn.Conv2d(3, 6, 3) 
    shape = self.conv1.weight.shape
    self.conv1.weight = nn.Parameter(torch.rand(shape)) + nn.Parameter(torch.rand(shape))

and use self.conv in the forward pass!
The difference is that here the values of the self.kernel will get optimized, but what I’m after is that I want the values of self.var1 and self.var2 to get changed so that their interactions result in a set of filters/kernels ( the resulting weight matrix) that minimize the loss.
In other words, in your self.kernel case, there are n entries that get learned, but in my case, there are 2x more variables that are used to create a kernel.
The method call is necessary since, each time its called in the forward pass, it runs the underlying operation that creates a new set of filters, then these filters are used, some losses will be produced and ultimately a set of gradients will be created, I want these gradients to update not the filters values themselves, but the self.var1 and self.var2 values that create the filters values

Krish · July 22, 2019, 4:32pm

Had a missing line of code above. This

def forward(self, x):
    self.some_function()
...

The values of kernel won’t be optimised directly, instead the optimizer will optimise values of var1 and var2 only as they are the parameters.
You can check that out by tracing the grad_fn backwards (tedious process, although).

Krish · July 22, 2019, 4:38pm

I don’t clearly understand what you are saying. Can you explain it?

Shisho_Sama · July 22, 2019, 5:30pm

Thanks, That actually worked Thanks a gazillion times sir
Also I noticed there is no need to use a class attribute for this and simply returning the result like this would also work :

    def some_method(self):
        """
            Suppose this is a custom method, tasked with
            producing values for each entry in the weight matrix
            of a convolutional layer. for simplicity we used 
            addition here
        """
        result = self.var1 + self.var2
        return result

    def forward(self, inputs):
        weight = self.some_method()
        output = F.conv2d(inputs, weight )

        print('var1 grad',self.var1.grad)
        print('weight grad',weight.grad)
        
        return output

When running this the self.var1.grad actually has values!
The catch here seems to be using the functional form for this to work only! (Why is that? I’d really would like to know why the first wouldn’t work?)

var1 grad None
weight grad None

var1 grad tensor([[[[ 0.0295,  0.0308],
          [ 0.0284,  0.0288]],

         [[ 0.0157,  0.0115],
          [ 0.0275,  0.0370]],

         [[ 0.0231,  0.0077],
          [ 0.0269,  0.0438]]],


        [[[ 0.0812,  0.0750],
          [ 0.0818,  0.0876]],

         [[ 0.0944,  0.0916],
          [ 0.0720,  0.0938]],

         [[ 0.0702,  0.0918],
          [ 0.0966,  0.0842]]],


        [[[-0.0011,  0.0065],
          [ 0.0024,  0.0097]],

         [[ 0.0100, -0.0002],
          [ 0.0133,  0.0061]],

         [[ 0.0084,  0.0118],
          [ 0.0030,  0.0073]]],


        [[[ 0.0204,  0.0375],
          [ 0.0376,  0.0324]],

         [[ 0.0149,  0.0122],
          [ 0.0230,  0.0258]],

         [[ 0.0214,  0.0505],
          [ 0.0280,  0.0071]]],


        [[[-0.0798, -0.0690],
          [-0.0813, -0.0650]],

         [[-0.0819, -0.0741],
          [-0.0805, -0.0698]],

         [[-0.0827, -0.0689],
          [-0.0654, -0.0692]]],


        [[[ 0.0201,  0.0045],
          [ 0.0195,  0.0160]],

         [[ 0.0128,  0.0298],
          [ 0.0173,  0.0195]],

         [[ 0.0271,  0.0173],
          [ 0.0270,  0.0049]]]])
weight grad None
var1 grad tensor([[[[ 0.0686,  0.0713],
          [ 0.0695,  0.0971]],

         [[ 0.0705,  0.0810],
          [ 0.0792,  0.0667]],

         [[ 0.0934,  0.0621],
          [ 0.0784,  0.0773]]],


        [[[ 0.4964,  0.4827],
          [ 0.4911,  0.4820]],

         [[ 0.5122,  0.5005],
          [ 0.5119,  0.4927]],

         [[ 0.4984,  0.5238],
          [ 0.4832,  0.5231]]],


        [[[-0.2394, -0.2491],
          [-0.2389, -0.2550]],

         [[-0.2355, -0.2570],
          [-0.2466, -0.2347]],

         [[-0.2434, -0.2351],
          [-0.2315, -0.2475]]],


        [[[ 0.1815,  0.1805],
          [ 0.2223,  0.2162]],

         [[ 0.2135,  0.2039],
          [ 0.1949,  0.2067]],

         [[ 0.2072,  0.2130],
          [ 0.2181,  0.1701]]],


        [[[-0.1945, -0.1760],
          [-0.2120, -0.2003]],

         [[-0.1972, -0.1877],
          [-0.2123, -0.1866]],

         [[-0.1686, -0.2197],
          [-0.1895, -0.2136]]],


        [[[ 0.0689,  0.0675],
          [ 0.0926,  0.0759]],

         [[ 0.0856,  0.0696],
          [ 0.0806,  0.0503]],

         [[ 0.0815,  0.0521],
          [ 0.0534,  0.0843]]]])
...

The weird thing is that although we have gradients, yet grad_fn is None for both self.var1 and self.var2

and concerning :

I don’t clearly understand what you are saying. Can you explain it?

for the sake of simplicity, lets assume we a kernel of size 2x2 .
this simply means we have 4 variables (separate values) that together form a kernel we know and use.
in the forward pass, what happens is that these 4 variables are used in different operations (multiplication, sum , etc) and then in the backward pass, each of them get a gradient with respect to how they contributed to the end loss.
Now, when I said, we have two variables for each entry, I simply meant suppose we don’t have this kernel of 2x2 (4 variables), instead suppose we have an empty 2x2 grid and we need to fill its entries.
in the first case, a variable is allocated to each entry. thus a grid of 2x2 has 4 variables that each represent a single entry! (the entry they occupy !). now suppose we want to fill this grid not with one variable but 2. meaning 2 variables with be used to create a value for a single entry in our grid.
2 variables per each entry. in other words, suppose we have a grid1 of 2x2 and a grid2 of 2x2 and we use these two grids to (for example add them together element wise), get the values for our empty grid i.e. kernel values. here this very kernel, is not treated as our variables, it simply a matrix of values, the real variables are gridl1 and grid2 respectively. I hope this makes it a bit more clear! Although I got what I was after thanks to you

By the way, What puzzles me now is that we have gradients for our vars, yet their grad_fn attribute is None!!?
Why is that?

Krish · July 22, 2019, 6:08pm

As I said earlier, a variable has a grad_fn only when an operation changes its value (except when the code is wrapped in with torch.no_grad():). Here no operation changes the value of var1 or var2.

Something you can try to understand is to run this code:

x = torch.ones(2).requires_grad_(True)
y = x + 2
print (y.grad_fn, x.grad_fn)

The output should be: AddBackward None

The gradients store how much the variable affects the loss. You can see it for yourself by running along with the above code:

y.sum()
y.backward()
print(x.grad)

Shisho_Sama · July 22, 2019, 6:43pm

Thanks a lot again got it this time
Now my only question that still persists, is why it doesn’t work on setting the conv layer weights directly using the non-functional form?

Krish · July 22, 2019, 7:41pm

That I would have to find out as well