Is there anyone get nan value during training with nn.Embedding?
I recently got nan gradient in embedding layer. I saw nan issue in the nn.Embedding model, but I don’t know whether that issue is resolved or not.
My use case is this, I try to use nn.Embedding as weight for linear layer and during forward, some selected row from nn.Embedding is used for calculating forward. Also, I apply weight_decay in optimizer.
I used to investigate when the nan gradient is generated and I found the nan is generated in the embedding model.
I saw some issue when embedding goes to zero, then nan is generated for gradient. I use pdb to check the row vector which gradient is nan, and I found that some values are very small like 1e-41.
Is there anyone encounter this kind of situation? How to resolve it? I will replace the nn.Embedding to nn.Parameter trying to avoid this kind of nan issue, but I want to know why this kind of situation is generated.