Custom loss function pollutes with NaNs after some time training

I wrote a custom vector similarity loss function as I wanted to experiment with different vector similarity heuristics. This is the class:

class CosineLoss(torch.nn.Module):
    '''
    Loss calculated on the cosine distance between batches of vectors:
        loss = 1 - label * a.b / (|a|*|b|)
    '''

    def __init__(self):
        super(CosineLoss, self).__init__()

    def cosine_similarity(self, mat1, mat2):
        return mat1.unsqueeze(1).bmm(mat2.unsqueeze(2)).squeeze() / \
            (torch.norm(mat1, 2, 1) * torch.norm(mat2, 2, 1))

    def forward(self, input_tensor, target_tensor, labels):
        sim = self.cosine_similarity(input_tensor, target_tensor)
        loss = (1.0 - labels * sim).sum() / labels.size(0)
        return loss

This has very similar behaviour to nn.CosineEmbeddingLoss: it takes two tensors and a set of labels, and calculates a positive or negative similarity loss depending on the labels’ sign. One difference is I have not used a margin (equivalent to margin = 0 in nn.CosineEmbeddingLoss). On two batches of vectors enc and dec, the loss calculation is:

self.error_f = CosineLoss()
labels = autograd.Variable(torch.ones(batch_size))
loss = self.error_f(enc, dec, labels) + \
    self.error_f(enc, dec[torch.randperm(batch_size)], -labels)

Here, I use the ground truth batch as a positive batch, and a shuffled batch as the negative batch (to avoid the easy minimum of zero valued parameters). I am able to train successfully with this loss and begin to converge, but after some time (30-40 epochs on a small dataset) the loss seems to pollute with NaNs when calculating the negative batch loss (the second term above).

Using the cosine loss from the nn library I am able to train without NaNs. However I don’t see anything immediate wrong with my implementation.

Is there some trick I have missed that was used when implementing nn.CosineEmbeddingLoss?

Does adding en epsilon in your cosine_similarity function when you divide by the norms help? These norms can go to 0 during training and would result to NaN values.

1 Like

@albanD adding an epsilon to the norms worked like a charm.

Thanks for the tip, great help!