Why does Pytorch updates weights of a layer if it does not contribute to output?

mohit117 · July 19, 2020, 10:32am

Hello,
I have a CNN model with a layer which though takes input from its previous layer and computes an output, this output is not used at all. Hence I expect that back propagation should not affects its weight.


... My Model init() part

   self.alpha = alpha_model

def forward(self, x, h, hist):

   alpha = self.alpha(hist)

   #x = torch.pow(x,alpha)
   #h = torch.pow(h,alpha)
....

As you can see alpha_model which is some other CNN which this CNN calls. alpha_model is doing some computation but I am no longer using its outputs. See the commented section. That is the only place where alpha_model had some usage.

Why then if print the weights of last layer of alpha_model, the weights are changing by 0.0001(which is the my learning rate)? Since it is not contributing to the output the gradients should be 0. Is this because I use Adam optimiser?

Here is a sample output. Here fc4 is the last layer of alpha_model

fc4 weights Parameter containing:
tensor([[ 0.1044, -0.0683, -0.1153, -0.0387, -0.0276,  0.1335, -0.0946,  0.0927,
          0.0041,  0.1364, -0.0192, -0.0634, -0.0361,  0.0884,  0.1091, -0.0954,
          0.1241,  0.1089, -0.0930,  0.0839,  0.0144,  0.0735,  0.0217,  0.0746,
         -0.0384, -0.0422, -0.0879,  0.0786,  0.0737, -0.0474,  0.1309, -0.0705,
         -0.0487, -0.1311, -0.0782, -0.0974,  0.0303,  0.0652,  0.0628, -0.0315,
         -0.0909, -0.0865, -0.0575, -0.1176,  0.0899, -0.0818,  0.0181, -0.1335,
          0.1153,  0.0833]], requires_grad=True)
--
fc4 grad tensor([[ 1.0000e-06, -1.0000e-06, -1.0000e-06, -1.0000e-06, -1.0000e-06,
          1.0000e-06, -1.0000e-06,  1.0000e-06,  1.0000e-06,  1.0000e-06,
         -1.0000e-06, -1.0000e-06, -1.0000e-06,  1.0000e-06,  1.0000e-06,
         -1.0000e-06,  1.0000e-06,  1.0000e-06, -1.0000e-06,  1.0000e-06,
          1.0000e-06,  1.0000e-06,  1.0000e-06,  1.0000e-06, -1.0000e-06,
         -1.0000e-06, -1.0000e-06,  1.0000e-06,  1.0000e-06, -1.0000e-06,
          1.0000e-06, -1.0000e-06, -1.0000e-06, -1.0000e-06, -1.0000e-06,
         -1.0000e-06,  1.0000e-06,  1.0000e-06,  1.0000e-06, -1.0000e-06,
         -1.0000e-06, -1.0000e-06, -1.0000e-06, -1.0000e-06,  1.0000e-06,
         -1.0000e-06,  1.0000e-06, -1.0000e-06,  1.0000e-06,  1.0000e-06]])
---------------------
fc4 weights Parameter containing:
tensor([[ 0.1043, -0.0682, -0.1152, -0.0386, -0.0275,  0.1334, -0.0945,  0.0926,
          0.0040,  0.1363, -0.0191, -0.0633, -0.0360,  0.0883,  0.1090, -0.0953,
          0.1240,  0.1088, -0.0929,  0.0838,  0.0143,  0.0734,  0.0216,  0.0745,
         -0.0383, -0.0421, -0.0878,  0.0785,  0.0736, -0.0474,  0.1308, -0.0704,
         -0.0486, -0.1310, -0.0781, -0.0973,  0.0302,  0.0652,  0.0627, -0.0314,
         -0.0908, -0.0864, -0.0574, -0.1175,  0.0898, -0.0817,  0.0180, -0.1334,
          0.1152,  0.0832]], requires_grad=True)
--
fc4 grad tensor([[ 2.0000e-06, -2.0000e-06, -2.0000e-06, -2.0000e-06, -2.0000e-06,
          2.0000e-06, -2.0000e-06,  2.0000e-06,  2.0000e-06,  2.0000e-06,
         -2.0000e-06, -2.0000e-06, -2.0000e-06,  2.0000e-06,  2.0000e-06,
         -2.0000e-06,  2.0000e-06,  2.0000e-06, -2.0000e-06,  2.0000e-06,
          2.0000e-06,  2.0000e-06,  2.0000e-06,  2.0000e-06, -2.0000e-06,
         -2.0000e-06, -2.0000e-06,  2.0000e-06,  2.0000e-06, -2.0000e-06,
          2.0000e-06, -2.0000e-06, -2.0000e-06, -2.0000e-06, -2.0000e-06,
         -2.0000e-06,  2.0000e-06,  2.0000e-06,  2.0000e-06, -2.0000e-06,
         -2.0000e-06, -2.0000e-06, -2.0000e-06, -2.0000e-06,  2.0000e-06,
         -2.0000e-06,  2.0000e-06, -2.0000e-06,  2.0000e-06,  2.0000e-06]])

Nikronic · July 19, 2020, 10:38am

mohit117:

Hi,

If you are using alpha which is the output of alpha_model in x and x the only way that alpha_model won’t contribute is that you throw away x and h too, if not, alpha_model is part of the definition of x so gradients will be calculated.
   #x = torch.pow(x,alpha)
   #h = torch.pow(h,alpha)

If you want to omit a part of graph from updating, just freeze their parameters using requires_grad = False. Then optimizer will skip it.

Bests

mohit117 · July 19, 2020, 11:16am

I cannot throw away x and h as they are being used in downstream tasks. Anyways I have commented out alpha,

   #x = torch.pow(x,alpha)
   #h = torch.pow(h,alpha)

and although it is computed it is not contributing. So should not the dervitaive of loss wrt any paramter in alpha be always 0.

albanD · July 20, 2020, 4:25pm

Hi,

This most likely happens because the optimizer you use has weight decay. And so you still update the weights even when the gradient is 0.

mohit117 · July 21, 2020, 5:12am

Thankyou for pointing out this possibility.

So you mean to say that, Adam optimizer which I am using by default has weight decay (L2 penalty on weights tensor )switched on which thus penalises the weights of the sub-module which does not contribute to the output.

But if this is the case i am simply using,
torch.optim. Adam ( model.parameters , lr=0.0001 )

And by default i think Adam has weight decay set to 0.

Thankyou

albanD · July 21, 2020, 1:32pm

In Adam, you actually have the exponential average term from the algorithm that can have a similar effect no?

mohit117 · July 23, 2020, 3:55am

Aaa. sorry but not an expert. Anyways I was just curious to know after I made this observation. Anyhow now I have removed the sub-module which I am not using. I guess that is the best solution

Thank you anyways for the suggestions and thoughts.