Hi,
while trying to implement the following paper in PyTorch: https://arxiv.org/abs/1705.08039, I encountered some issues with my gradients. I tried checking the sanity of the values and everything seems to be correct but still get some NaN values for the gradients.

Here is the repo implementing the code in a Jupyter Notebook:

Also, since this is the first time I try to implement some paper in PyTorch primitives, I would be glad if someone more experienced could review my code and maybe point some best practices and/or optimization that I could apply.

Thank you for your precious help and guidance for the code review!

θ/∥θ∥−ε is what it says in the paper – EPS needs to be outside the denominator, otherwise you’re increasing the big embeddings instead of decreasing them.

instead of:

gamma = gamma.clamp(min=1)

do:

gamma = gamma.clamp(min=1+EPS)

This ensures that there is a minimal distance between equivalent embeddings (same word). Even though this seems counter-intuitive, doing this avoids getting a zero-divisor in the derivative (see the definition of γ in the partial derivative of Poincaré distance in the paper).

Amazing, I corrected as you mentionned and I have pushed the new version on github. We might need to work on the speed and complexity of the implementation since this is very slow compared to the C++ implementation.

Thank you for your help, if you have any inputs to make it faster, let me know.