How to add a loss term from hidden layers?

Say I want to add a loss term (i.e. gradients from this loss should be propagated) from the output of hidden layers itself i.e. we use the output of a hidden layer and say pass it to a square function (hidden_output/activation)^2. How can I implement this in PyTorch?
Thanks in advance!

1 Like

I’m not sure if I understand the use case correctly, but you could use any output of a layer and add it to the loss before calling backward.

Hi! I’m a new user of PyTorch and I have only used the predefined functions namely, I have first defined outputs from a Neural Net compared it with labels and applied it to criterion nn.CrossEntropyLoss as given in the example for training a simple classifier. So how do we add the outputs from a layer itself?

Here is a small dummy example to use the some intermediate activation in your loss:

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.fc1 = nn.Linear(10, 64)
        self.fc2 = nn.Linear(64, 10)

    def forward(self, x):
        x1 = F.relu(self.fc1(x))
        x = self.fc2(x1)
        return x, x1


x = torch.randn(10, 10)
y = torch.randn(10, 10)
model = MyModel()
optimizer = optim.SGD(model.parameters(), lr=1e-0)
criterion = nn.MSELoss()

for epoch in range(100):
    optimizer.zero_grad()
    output, aux = model(x)
    loss = criterion(output, y)
    loss = loss + (aux**2).mean()
    loss.backward()
    optimizer.step()
    
    print('Epoch {}, loss {}, aux norm {}'.format(
        epoch, loss.item(), aux.norm()))

Would this work as a starter for your use case or are you dealing with another problem?

3 Likes

Thank You! Very cool trick, I was not aware of it. I think this will do If not I’ll refer you again.

I believe the aux loss should not impact the weights of fc2 and should only impact fc1. However when I try to compare the weights of 2 networks one with aux and one without aux, the weights are different. I don’t fully understand why did that happen ?

class MyModel(nn.Module):
def init(self):
super(MyModel, self).init()
self.fc1 = nn.Linear(10, 64)
self.fc2 = nn.Linear(64, 10)

def forward(self, x):
    x1 = F.relu(self.fc1(x))
    x = self.fc2(x1)
    return x, x1

x = torch.randn(10, 10)
y = torch.randn(10, 10)
model = MyModel()
model2 = MyModel()
model2.load_state_dict(model.state_dict())
optimizer = torch.optim.SGD(model.parameters(), lr=1e-0)
optimizer2 = torch.optim.SGD(model.parameters(), lr=1e-0)
criterion = nn.MSELoss()

for epoch in range(100):
optimizer.zero_grad()
output, aux = model(x)
loss = criterion(output, y)
loss = loss + (aux**2).mean()
loss.backward()
optimizer.step()

optimizer2.zero_grad()
output2, _ = model2(x)
loss2 = criterion(output2, y)
loss2.backward()
optimizer2.step()

print((model2.fc2.weight.detach() == model.fc2.weight.detach()).all())

In your code you are using model.parameters() for both optimizers, so that model2 won’t get any updates.
After fixing this, you would expect to see the same parameters after loading the state_dict and after the first weight updates.
Since model.fc1 was updated in another way than model2.fc1, you cannot expect the subsequent iterations of the used 100 to yield the same result for .fc2.weight, since the output and thus loss would be different (otherwise there would be no need to use the aux output/loss).

Also, you can post code snippets by wrapping them into three backticks ```, which makes debugging easier. :wink:

1 Like