In order to explain my issue, I’ll just show step by step:
First let’s build a simple model:
class Model(nn.Module):
def __init__(self):
super(Model,self).__init__()
self.fc = nn.Linear(10,2)
def forward(self,x):
return self.fc(x)
Let’s declare the model ,optimizer and a random input:
model = Model()
optimizer = torch.optim.SGD(model.parameters(),lr=1)
x = torch.rand(10)
Let’s run the model:
out = model(x)
We are not going to use any known loss function but we will do the following:
prob1 = F.softmax(out,dim=-1)
prob2 = F.softmax(out,dim=-1)
loss = prob1 - prob2.detach()
loss = torch.sum(loss)
loss.backward()
The loss suppose to be zero, the gradients should flow through one of the softmax.
Let’s print the loss and the gradient of fc.w:
loss: tensor(0., grad_fn=)
gradient: tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
At least for me, it make sense.
But, Let’s make a small change, we’ll add a target as follows:
target = torch.LongTensor([1])
target_onehot = F.one_hot(target,2)
prob1 = F.softmax(out,dim=-1)
prob2 = F.softmax(out,dim=-1)
loss = prob1 - prob2.detach()
loss = torch.sum(loss*target_onehot)
loss.backward()
Now let’s run and print again:
sum_loss: tensor(0., grad_fn=)
gradient: tensor([[-0.1637, -0.1643, -0.2050, -0.2044, -0.1300, -0.0278, -0.2135, -0.0859,
-0.2186, -0.0658],
[ 0.1637, 0.1643, 0.2050, 0.2044, 0.1300, 0.0278, 0.2135, 0.0859,
0.2186, 0.0658]])
The target is multiplied by loss which is a vector of zeros, how can it be that now we have gradients?
even if… why the multiplication with “target” helps? he is not an important part of the backward graph?
What am I missing?
Thanks a lot!!!