Thanks for the reply. A correction in code cell 4
loss += criterion(outputs, targets)
instead of:
loss = criterion(outputs, targets)
These gates are a part of a big CNN and i need to apply norm on only these gates but the real problem arises when I do norm loss + criterion loss only norm loss act on these gates and they starts decreasing linearly.
Example:
Parameter containing:
tensor([0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900,
0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900,
0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900,
0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900,
0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900,
0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900,
0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900,
0.9900], device='cuda:0', requires_grad=True)
9.449404761904763
Parameter containing:
tensor([0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900,
0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900,
0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900,
0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900,
0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900,
0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900,
0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900,
0.7900], device='cuda:0', requires_grad=True)
10.632621951219512
Parameter containing:
tensor([0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900,
0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900,
0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900,
0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900,
0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900,
0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900,
0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900,
0.5900], device='cuda:0', requires_grad=True)
From above pettern it is evedent that only norm loss is updating these gates and they are linearly decreasing with same step size, if both the loses were acting on these parameters then the output would have looked something like this:
Parameter containing:
tensor([0.8699, 0.8551, 0.8640, 0.8520, 0.8678, 0.8507, 0.8549, 0.8534, 0.8580,
0.8608, 0.8743, 0.8603, 0.8631, 0.8679, 0.8597, 0.8569, 0.8374, 0.8583,
0.8512, 0.8547, 0.8615, 0.8548, 0.8784, 0.8510, 0.8589, 0.8481, 0.8453,
0.8566, 0.8598, 0.8733, 0.8685, 0.8767, 0.8714, 0.8580, 0.8629, 0.8675,
0.8576, 0.8471, 0.8630, 0.8706, 0.8667, 0.8785, 0.8623, 0.8619, 0.8527,
0.8507, 0.8524, 0.8556, 0.8650, 0.8653, 0.8517, 0.8492, 0.8602, 0.8517,
0.8635, 0.8539, 0.8691, 0.8669, 0.8660, 0.8596, 0.8686, 0.8661, 0.8424,
0.8724], device='cuda:0', requires_grad=True)
If you need more information please let me know.