Hi everyone,
I am having problems implementing my own version of the hinge loss function.
In particular it works but after the first epoch the loss computed is zero.
I am pretty sure that I messed up something related to the dimension of the input data of the function.
Here is the code, I would really appreciate your help, I am kinda losing my mind.
ps: the network is a resnet and the main problem is image classification of cifar100
class MultiClassSquaredHingeLoss(nn.Module):
def __init__(self):
super(MultiClassSquaredHingeLoss, self).__init__()
def forward(self, output, y): #output: batchsize*n_class
n_class = y.size(1)
#margin = 1
margin = 1
#isolate the score for the true class
y_out = torch.sum(torch.mul(output, y)).cuda()
output_y = torch.mul(torch.ones(n_class).cuda(), y_out).cuda()
#create an opposite to the one hot encoded tensor
anti_y = torch.ones(n_class).cuda() - y.cuda()
loss = output.cuda() - output_y.cuda() + margin
loss = loss.cuda()
#remove the element of the loss corresponding to the true class
loss = torch.mul(loss.cuda(), anti_y.cuda()).cuda()
#max(0,_)
loss = torch.max(loss.cuda(), torch.zeros(n_class).cuda())
#squared hinge loss
loss = torch.pow(loss, 2).cuda()
#sum up
loss = torch.sum(loss).cuda()
loss = loss / n_class
return loss
Could it be that the first gradient step pushes everything on the flat side of the hinge and so it just returns 0?
Also few question:
Why so many .cuda() calls? You should not need them if everything is already on the GPU. If everything is on the CPU, you might not want to shuffle things around that much for perf reasons.
torch.mul(torch.ones(n_class).cuda(), y_out) is the same as y_out right?
anti_y = torch.ones(n_class).cuda() - y.cuda() can be done as anti_y = 1 - y.cuda() to make it simpler to read and more efficient
torch.zeros(n_class).cuda() can be replaced by torch.zeros(n_class, device="cuda") if you want.
thank you for answering!
The cuda calls are there because it kept giving me errors about something tha was on the cpu, while it expected cuda and I tried them everywhere.
torch.mul(torch.ones(n_class).cuda(), y_out) is supposed to be a tensor of size n_class that has all values equal to y_out, which is the value of the score for the true class (probabily there is a more elegant solution to do it, I kinda learned python by trial and error…)
do you have any advice on how to avoid the gradient pushing everything on the flat side?
the problem is that it is very bad on the training set. after the first epoch the accuracy is 0.09 and then the loss goes to 0 and stays there and then of course the training doesn’t go on because it thinks it has reached perfection
In that case I guess your implementation is not correct. Did you checked on simple examples that it actually computes the multi class squared hinge loss?
Also you can try to compare it to other implementations online (google search returned: https://github.com/HaotianMXu/Multiclass_LinearSVM_with_SGD for example).