I wrote a custom vector similarity loss function as I wanted to experiment with different vector similarity heuristics. This is the class:
class CosineLoss(torch.nn.Module): ''' Loss calculated on the cosine distance between batches of vectors: loss = 1 - label * a.b / (|a|*|b|) ''' def __init__(self): super(CosineLoss, self).__init__() def cosine_similarity(self, mat1, mat2): return mat1.unsqueeze(1).bmm(mat2.unsqueeze(2)).squeeze() / \ (torch.norm(mat1, 2, 1) * torch.norm(mat2, 2, 1)) def forward(self, input_tensor, target_tensor, labels): sim = self.cosine_similarity(input_tensor, target_tensor) loss = (1.0 - labels * sim).sum() / labels.size(0) return loss
This has very similar behaviour to
nn.CosineEmbeddingLoss: it takes two tensors and a set of labels, and calculates a positive or negative similarity loss depending on the labels’ sign. One difference is I have not used a margin (equivalent to margin = 0 in
nn.CosineEmbeddingLoss). On two batches of vectors
dec, the loss calculation is:
self.error_f = CosineLoss() labels = autograd.Variable(torch.ones(batch_size)) loss = self.error_f(enc, dec, labels) + \ self.error_f(enc, dec[torch.randperm(batch_size)], -labels)
Here, I use the ground truth batch as a positive batch, and a shuffled batch as the negative batch (to avoid the easy minimum of zero valued parameters). I am able to train successfully with this loss and begin to converge, but after some time (30-40 epochs on a small dataset) the loss seems to pollute with NaNs when calculating the negative batch loss (the second term above).
Using the cosine loss from the nn library I am able to train without NaNs. However I don’t see anything immediate wrong with my implementation.
Is there some trick I have missed that was used when implementing