Why does Pytorch updates weights of a layer if it does not contribute to output?

Hello,
I have a CNN model with a layer which though takes input from its previous layer and computes an output, this output is not used at all. Hence I expect that back propagation should not affects its weight.


... My Model init() part

   self.alpha = alpha_model

def forward(self, x, h, hist):

   alpha = self.alpha(hist)

   #x = torch.pow(x,alpha)
   #h = torch.pow(h,alpha)
....

As you can see alpha_model which is some other CNN which this CNN calls. alpha_model is doing some computation but I am no longer using its outputs. See the commented section. That is the only place where alpha_model had some usage.

Why then if print the weights of last layer of alpha_model, the weights are changing by 0.0001(which is the my learning rate)? Since it is not contributing to the output the gradients should be 0. Is this because I use Adam optimiser?

Here is a sample output. Here fc4 is the last layer of alpha_model

fc4 weights Parameter containing:
tensor([[ 0.1044, -0.0683, -0.1153, -0.0387, -0.0276,  0.1335, -0.0946,  0.0927,
          0.0041,  0.1364, -0.0192, -0.0634, -0.0361,  0.0884,  0.1091, -0.0954,
          0.1241,  0.1089, -0.0930,  0.0839,  0.0144,  0.0735,  0.0217,  0.0746,
         -0.0384, -0.0422, -0.0879,  0.0786,  0.0737, -0.0474,  0.1309, -0.0705,
         -0.0487, -0.1311, -0.0782, -0.0974,  0.0303,  0.0652,  0.0628, -0.0315,
         -0.0909, -0.0865, -0.0575, -0.1176,  0.0899, -0.0818,  0.0181, -0.1335,
          0.1153,  0.0833]], requires_grad=True)
--
fc4 grad tensor([[ 1.0000e-06, -1.0000e-06, -1.0000e-06, -1.0000e-06, -1.0000e-06,
          1.0000e-06, -1.0000e-06,  1.0000e-06,  1.0000e-06,  1.0000e-06,
         -1.0000e-06, -1.0000e-06, -1.0000e-06,  1.0000e-06,  1.0000e-06,
         -1.0000e-06,  1.0000e-06,  1.0000e-06, -1.0000e-06,  1.0000e-06,
          1.0000e-06,  1.0000e-06,  1.0000e-06,  1.0000e-06, -1.0000e-06,
         -1.0000e-06, -1.0000e-06,  1.0000e-06,  1.0000e-06, -1.0000e-06,
          1.0000e-06, -1.0000e-06, -1.0000e-06, -1.0000e-06, -1.0000e-06,
         -1.0000e-06,  1.0000e-06,  1.0000e-06,  1.0000e-06, -1.0000e-06,
         -1.0000e-06, -1.0000e-06, -1.0000e-06, -1.0000e-06,  1.0000e-06,
         -1.0000e-06,  1.0000e-06, -1.0000e-06,  1.0000e-06,  1.0000e-06]])
---------------------
fc4 weights Parameter containing:
tensor([[ 0.1043, -0.0682, -0.1152, -0.0386, -0.0275,  0.1334, -0.0945,  0.0926,
          0.0040,  0.1363, -0.0191, -0.0633, -0.0360,  0.0883,  0.1090, -0.0953,
          0.1240,  0.1088, -0.0929,  0.0838,  0.0143,  0.0734,  0.0216,  0.0745,
         -0.0383, -0.0421, -0.0878,  0.0785,  0.0736, -0.0474,  0.1308, -0.0704,
         -0.0486, -0.1310, -0.0781, -0.0973,  0.0302,  0.0652,  0.0627, -0.0314,
         -0.0908, -0.0864, -0.0574, -0.1175,  0.0898, -0.0817,  0.0180, -0.1334,
          0.1152,  0.0832]], requires_grad=True)
--
fc4 grad tensor([[ 2.0000e-06, -2.0000e-06, -2.0000e-06, -2.0000e-06, -2.0000e-06,
          2.0000e-06, -2.0000e-06,  2.0000e-06,  2.0000e-06,  2.0000e-06,
         -2.0000e-06, -2.0000e-06, -2.0000e-06,  2.0000e-06,  2.0000e-06,
         -2.0000e-06,  2.0000e-06,  2.0000e-06, -2.0000e-06,  2.0000e-06,
          2.0000e-06,  2.0000e-06,  2.0000e-06,  2.0000e-06, -2.0000e-06,
         -2.0000e-06, -2.0000e-06,  2.0000e-06,  2.0000e-06, -2.0000e-06,
          2.0000e-06, -2.0000e-06, -2.0000e-06, -2.0000e-06, -2.0000e-06,
         -2.0000e-06,  2.0000e-06,  2.0000e-06,  2.0000e-06, -2.0000e-06,
         -2.0000e-06, -2.0000e-06, -2.0000e-06, -2.0000e-06,  2.0000e-06,
         -2.0000e-06,  2.0000e-06, -2.0000e-06,  2.0000e-06,  2.0000e-06]])

If you want to omit a part of graph from updating, just freeze their parameters using requires_grad = False. Then optimizer will skip it.

Bests

I cannot throw away x and h as they are being used in downstream tasks. Anyways I have commented out alpha,

   #x = torch.pow(x,alpha)
   #h = torch.pow(h,alpha)

and although it is computed it is not contributing. So should not the dervitaive of loss wrt any paramter in alpha be always 0.

Hi,

This most likely happens because the optimizer you use has weight decay. And so you still update the weights even when the gradient is 0.

1 Like

Thankyou for pointing out this possibility.

So you mean to say that, Adam optimizer which I am using by default has weight decay (L2 penalty on weights tensor )switched on which thus penalises the weights of the sub-module which does not contribute to the output.

But if this is the case i am simply using,
torch.optim. Adam ( model.parameters , lr=0.0001 )

And by default i think Adam has weight decay set to 0.

Thankyou

In Adam, you actually have the exponential average term from the algorithm that can have a similar effect no?

Aaa. sorry but not an expert. Anyways I was just curious to know after I made this observation. Anyhow now I have removed the sub-module which I am not using. I guess that is the best solution :slightly_smiling_face:

Thank you anyways for the suggestions and thoughts.