Pytorch not updating cunstom layer variables under certain settings

MUmairDogar · February 20, 2022, 2:36pm

I am new to pytorch and trying to port my old work from TF. I am using following simple custom layer to train a mask for filters in the network:

class Gates(nn.Module):
    def __init__(self, size):
        super().__init__()
        self.size = size
        f = torch.from_numpy(np.ones(size,np.single))
        self.weight = nn.Parameter(f)
    def forward(self, x):
        return x*torch.reshape(self.weight,(1,self.size,1,1))

When I train this network using adam optimizer it does update custom layer variables but when I use SGD optimiser the custom layer variables are not updated and they remain stuck at 1:
This works:

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters(), lr=0.001)

This does not work:

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001)

Furthermore my main problem is that I want to add L1 regularization to custom layer variables but when I do that using the following code, under any of the above mentioned settings, the only loss that updates custom layer variables is just the norm loss not the criterion loss.

optimizer.zero_grad()
outputs = net(inputs)
reg_loss1 = None
for m in net.modules():
    if isinstance(m, Gates):
        if reg_loss1 is None:
            reg_loss1 = m.weight.abs().sum()
        else:
            reg_loss1 += m.weight.abs().sum()
loss = reg_loss1
loss = criterion(outputs, targets)
loss.backward()

TIA

ptrblck · February 21, 2022, 8:25pm

Your model seems to be working fine also with SGD, but the gradients seem to be quite small:

class Gates(nn.Module):
    def __init__(self, size):
        super().__init__()
        self.size = size
        f = torch.from_numpy(np.ones(size,np.single))
        self.weight = nn.Parameter(f)
    def forward(self, x):
        return x*torch.reshape(self.weight,(1,self.size,1,1))
    
model = Gates(10)
x = torch.randn(1, 1)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

print(model.weight)
# > tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], requires_grad=True)

out = model(x)
loss = criterion(out, torch.randint(0, 2, (1, 1, 1)))
loss.backward()
optimizer.step()

print(model.weight)
# > tensor([1.0000, 0.9998, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
#          1.0000], requires_grad=True)

As you can see, the parameter gets an update but given your observation your expected step size might be larger.

MUmairDogar · February 21, 2022, 9:10pm

Thanks for the reply. A correction in code cell 4
loss += criterion(outputs, targets)
instead of:
loss = criterion(outputs, targets)

These gates are a part of a big CNN and i need to apply norm on only these gates but the real problem arises when I do norm loss + criterion loss only norm loss act on these gates and they starts decreasing linearly.

Example:

Parameter containing:
tensor([0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900,
        0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900,
        0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900,
        0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900,
        0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900,
        0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900,
        0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900, 0.9900,
        0.9900], device='cuda:0', requires_grad=True)
9.449404761904763
Parameter containing:
tensor([0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900,
        0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900,
        0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900,
        0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900,
        0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900,
        0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900,
        0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900, 0.7900,
        0.7900], device='cuda:0', requires_grad=True)
10.632621951219512
Parameter containing:
tensor([0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900,
        0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900,
        0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900,
        0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900,
        0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900,
        0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900,
        0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900, 0.5900,
        0.5900], device='cuda:0', requires_grad=True)

From above pettern it is evedent that only norm loss is updating these gates and they are linearly decreasing with same step size, if both the loses were acting on these parameters then the output would have looked something like this:

Parameter containing:
tensor([0.8699, 0.8551, 0.8640, 0.8520, 0.8678, 0.8507, 0.8549, 0.8534, 0.8580,
        0.8608, 0.8743, 0.8603, 0.8631, 0.8679, 0.8597, 0.8569, 0.8374, 0.8583,
        0.8512, 0.8547, 0.8615, 0.8548, 0.8784, 0.8510, 0.8589, 0.8481, 0.8453,
        0.8566, 0.8598, 0.8733, 0.8685, 0.8767, 0.8714, 0.8580, 0.8629, 0.8675,
        0.8576, 0.8471, 0.8630, 0.8706, 0.8667, 0.8785, 0.8623, 0.8619, 0.8527,
        0.8507, 0.8524, 0.8556, 0.8650, 0.8653, 0.8517, 0.8492, 0.8602, 0.8517,
        0.8635, 0.8539, 0.8691, 0.8669, 0.8660, 0.8596, 0.8686, 0.8661, 0.8424,
        0.8724], device='cuda:0', requires_grad=True)

If you need more information please let me know.

ptrblck · February 21, 2022, 10:02pm

I’m not sure that’s the case since (as previously described) the updates from nn.CrossEntropyLoss are tiny in your use case. Increase the learning rate to e.g. 10 and you should be able to see the nn.CrossEntropyLoss updates.

MUmairDogar · February 25, 2022, 12:34pm

Thanks for the reply. yes it was a problem of balancing/normalizing different losses.