I am implementing a recommendation algorithm that jointly optimizes the losses of tensor decomposition and Bayesian Personalized Ranking (BPR) in this paper with Pytorch.
However, whenever I adopt SGD as the optimizer, the loss tends to diverge (due to unknown reason), and the embeddings easily diverge to “nan”…Even though Adam does not diverge easily, the results do not seem to coincide with the paper…
I tried to lower the learning rate, but the execution became slower.
I’m wondering if there is any bug in my code (Link). For instance,
- I’m wondering if the data loader for personalized ranking in the execution file (in the above link) is constructed in the right way.
- I’m wondering if there is any wrong in the training algorithm implementation (see “fit” in the class file in the folder “model” of the above link)
- In addition to the learning rate, I’m wondering if such divergence depends on the data size, batch size, or the setting of the optimizer?
Any suggestion is appreciated. Thanks!