Problems implementing my own loss

Hi everyone,
I am having problems implementing my own version of the hinge loss function.
In particular it works but after the first epoch the loss computed is zero.
I am pretty sure that I messed up something related to the dimension of the input data of the function.
Here is the code, I would really appreciate your help, I am kinda losing my mind.
ps: the network is a resnet and the main problem is image classification of cifar100

class MultiClassSquaredHingeLoss(nn.Module):
    def __init__(self):
        super(MultiClassSquaredHingeLoss, self).__init__()

    def forward(self, output, y): #output: batchsize*n_class
        n_class = y.size(1)
        #margin = 1 
        margin = 1
        #isolate the score for the true class
        y_out = torch.sum(torch.mul(output, y)).cuda()
        output_y = torch.mul(torch.ones(n_class).cuda(), y_out).cuda()
        #create an opposite to the one hot encoded tensor
        anti_y = torch.ones(n_class).cuda() - y.cuda()
        
        loss = output.cuda() - output_y.cuda() + margin
        loss = loss.cuda()
        #remove the element of the loss corresponding to the true class
        loss = torch.mul(loss.cuda(), anti_y.cuda()).cuda()
        #max(0,_)
        loss = torch.max(loss.cuda(), torch.zeros(n_class).cuda())
        #squared hinge loss
        loss = torch.pow(loss, 2).cuda()
        #sum up
        loss = torch.sum(loss).cuda()
        loss = loss / n_class        
        
        return loss

Hi,

Could it be that the first gradient step pushes everything on the flat side of the hinge and so it just returns 0?

Also few question:

  • Why so many .cuda() calls? You should not need them if everything is already on the GPU. If everything is on the CPU, you might not want to shuffle things around that much for perf reasons.
  • torch.mul(torch.ones(n_class).cuda(), y_out) is the same as y_out right?
  • anti_y = torch.ones(n_class).cuda() - y.cuda() can be done as anti_y = 1 - y.cuda() to make it simpler to read and more efficient
  • torch.zeros(n_class).cuda() can be replaced by torch.zeros(n_class, device="cuda") if you want.
  • torch.max(loss.cuda(), torch.zeros(n_class).cuda()) -> torch.threshold(loss, 0, 0)

thank you for answering!
The cuda calls are there because it kept giving me errors about something tha was on the cpu, while it expected cuda and I tried them everywhere.

torch.mul(torch.ones(n_class).cuda(), y_out) is supposed to be a tensor of size n_class that has all values equal to y_out, which is the value of the score for the true class (probabily there is a more elegant solution to do it, I kinda learned python by trial and error…)

do you have any advice on how to avoid the gradient pushing everything on the flat side?

Well, that’s your goal. Once the hinge is all 0s, then you perfectly classify all your samples. So your job is done :smiley:

the issue is that the performance is really poor

would you implement it differently?

You mean classification? It should be perfect on your training set.

the problem is that it is very bad on the training set. after the first epoch the accuracy is 0.09 and then the loss goes to 0 and stays there and then of course the training doesn’t go on because it thinks it has reached perfection

In that case I guess your implementation is not correct. Did you checked on simple examples that it actually computes the multi class squared hinge loss?
Also you can try to compare it to other implementations online (google search returned: https://github.com/HaotianMXu/Multiclass_LinearSVM_with_SGD for example).

I based my implementation on that repository, I will try swapping them and seeing if I can make it work