I am having problems implementing my own version of the hinge loss function.
In particular it works but after the first epoch the loss computed is zero.
I am pretty sure that I messed up something related to the dimension of the input data of the function.
Here is the code, I would really appreciate your help, I am kinda losing my mind.
ps: the network is a resnet and the main problem is image classification of cifar100
def forward(self, output, y): #output: batchsize*n_class
n_class = y.size(1)
#margin = 1
margin = 1
#isolate the score for the true class
y_out = torch.sum(torch.mul(output, y)).cuda()
output_y = torch.mul(torch.ones(n_class).cuda(), y_out).cuda()
#create an opposite to the one hot encoded tensor
anti_y = torch.ones(n_class).cuda() - y.cuda()
loss = output.cuda() - output_y.cuda() + margin
loss = loss.cuda()
#remove the element of the loss corresponding to the true class
loss = torch.mul(loss.cuda(), anti_y.cuda()).cuda()
loss = torch.max(loss.cuda(), torch.zeros(n_class).cuda())
#squared hinge loss
loss = torch.pow(loss, 2).cuda()
loss = torch.sum(loss).cuda()
loss = loss / n_class
Could it be that the first gradient step pushes everything on the flat side of the hinge and so it just returns 0?
Also few question:
- Why so many
.cuda() calls? You should not need them if everything is already on the GPU. If everything is on the CPU, you might not want to shuffle things around that much for perf reasons.
torch.mul(torch.ones(n_class).cuda(), y_out) is the same as
anti_y = torch.ones(n_class).cuda() - y.cuda() can be done as
anti_y = 1 - y.cuda() to make it simpler to read and more efficient
torch.zeros(n_class).cuda() can be replaced by
torch.zeros(n_class, device="cuda") if you want.
torch.max(loss.cuda(), torch.zeros(n_class).cuda()) ->
torch.threshold(loss, 0, 0)
thank you for answering!
The cuda calls are there because it kept giving me errors about something tha was on the cpu, while it expected cuda and I tried them everywhere.
torch.mul(torch.ones(n_class).cuda(), y_out) is supposed to be a tensor of size n_class that has all values equal to y_out, which is the value of the score for the true class (probabily there is a more elegant solution to do it, I kinda learned python by trial and error…)
do you have any advice on how to avoid the gradient pushing everything on the flat side?
Well, that’s your goal. Once the hinge is all 0s, then you perfectly classify all your samples. So your job is done
the issue is that the performance is really poor
would you implement it differently?
You mean classification? It should be perfect on your training set.
the problem is that it is very bad on the training set. after the first epoch the accuracy is 0.09 and then the loss goes to 0 and stays there and then of course the training doesn’t go on because it thinks it has reached perfection
In that case I guess your implementation is not correct. Did you checked on simple examples that it actually computes the multi class squared hinge loss?
Also you can try to compare it to other implementations online (google search returned: https://github.com/HaotianMXu/Multiclass_LinearSVM_with_SGD for example).
I based my implementation on that repository, I will try swapping them and seeing if I can make it work