Embeddings become NaN

I have a set of entities and some observed relationships among them, and I want to predict new relationships among the entities. The observed relationships are in the form of triples of the form:

h, l, t

indicating that the entity with id ‘h’ has relation ‘l’ with entity ‘t’.
I’m implementing the tensor factorization method in this paper (see top of page 3 for the algorithm), in which an embedding is to be learned for each entity and relation such that dissims(h, l, t)=||embedding(h) + embedding(l) - embedding(t)|| is small if two entities h and t have relation l, and is large otherwise.

Our training data only contains positive triples so for training, each time a batch of positive triples (PTs) is sampled, a negative triple (NT) is created for each PT in the batch by corrupting the positive triple, and then the following objective function is minimized:

sum_{<PT, NT> in batch} [gamma + dissims(PT) - dissims(NT)]+

where

[x]+ = max(0, x)

and gamma is a margin hyperparameter. This objective function is basically saying that the dissimilarity for the PT should be less than the dissimilarity for the NT.
After each batch, the entity embeddings are normalized to have a norm of one (line 5 of the algorithm).

Here is my implementation. The code works fine and reduces the training error in each step. But all of a sudden, the embedding values become NaN, and everything breaks.

Anyone can help me understand why the embedding values are becoming Nan? Thanks!

gradient of torch.norm at 0 (in version 0.2 and before) is NaN.

It might be that the backward of any of the norm ops is generating nan.

We fixed this in master, where the gradient for norm uses subgradients and hence grad(norm(0)) = 0, it will be part of next release (or you can try it by building PyTorch from source: https://github.com/pytorch/pytorch#from-source )