I wrote a custom vector similarity loss function as I wanted to experiment with different vector similarity heuristics. This is the class:

```
class CosineLoss(torch.nn.Module):
'''
Loss calculated on the cosine distance between batches of vectors:
loss = 1 - label * a.b / (|a|*|b|)
'''
def __init__(self):
super(CosineLoss, self).__init__()
def cosine_similarity(self, mat1, mat2):
return mat1.unsqueeze(1).bmm(mat2.unsqueeze(2)).squeeze() / \
(torch.norm(mat1, 2, 1) * torch.norm(mat2, 2, 1))
def forward(self, input_tensor, target_tensor, labels):
sim = self.cosine_similarity(input_tensor, target_tensor)
loss = (1.0 - labels * sim).sum() / labels.size(0)
return loss
```

This has very similar behaviour to `nn.CosineEmbeddingLoss`

: it takes two tensors and a set of labels, and calculates a positive or negative similarity loss depending on the labels’ sign. One difference is I have not used a margin (equivalent to margin = 0 in `nn.CosineEmbeddingLoss`

). On two batches of vectors `enc`

and `dec`

, the loss calculation is:

```
self.error_f = CosineLoss()
labels = autograd.Variable(torch.ones(batch_size))
loss = self.error_f(enc, dec, labels) + \
self.error_f(enc, dec[torch.randperm(batch_size)], -labels)
```

Here, I use the ground truth batch as a positive batch, and a shuffled batch as the negative batch (to avoid the easy minimum of zero valued parameters). I am able to train successfully with this loss and begin to converge, but after some time (30-40 epochs on a small dataset) the loss seems to pollute with NaNs when calculating the **negative** batch loss (the second term above).

Using the cosine loss from the nn library I am able to train without NaNs. However I don’t see anything immediate wrong with my implementation.

Is there some trick I have missed that was used when implementing `nn.CosineEmbeddingLoss`

?