Embedding returns nan issue

my_torch · February 3, 2018, 4:45am

Hi, I created a model with 2 embeddings and an lstm module like below.
My input is of the shape (batch_size, window_size, feature_len), so a 3-dimension matrix.

class model (nn.Module):
    def __init__(self, some parameters):
          self.embed1 = nn.Embedding(len1, 10)
          self.embed2 = nn.Embedding(len2, 10)
    def forward(self, vec_seq):
          # vec_seq.size() --> (batch_size, window_size, feature_len)
          # vec_seq.type() --> torch.cuda.FloatTensor
          var1 = autograd.Variable(vec_seq[:,:,0].type(torch.cuda.LongTensor))
          var2 = autograd.Variable(vec_seq[:,:,1].type(torch.cuda.LongTensor))
          var1_emb = self.embed1(var1)
          var2_emb = self.embed2(var2)
          var_embed = torch.cat((var1_emb, var2_emb,
                                         autograd.Variable(vec_seq[:,:,2].contiguous().view(batch_size, window_size, -1)),
                                        autograd.Variable(vec_seq[:,:,3].contiguous().view(batch_size, window_size, -1)), 2)
          #  following some LSTM operations.

However, when I debug my program, I found all the values of var1_embed and var2_embed are nan, which is quite weird. In some cases, they are not all nan, instead part of the embedding is nan and the remaining is a float. like the image below.
ip_var
The corresponding embedding is like below.

I used DataLoader to load the data with following statement.
train_dataloader = DataLoader(train_data,batch_size=opt.batch_size,shuffle=True,num_workers=opt.num_workers).
where batch_size=4 and num_workers=4.
What’s leading to this error?

Thanks.

chenyuntc · February 4, 2018, 7:49am

Hard to say. It would be better if you could provide a minimal scripts that I reproduce your problem directly in my laptop.

chenyuntc · February 4, 2018, 7:50am

Currently, I’ll guess something is wrong with your lstm module.

my_torch · February 4, 2018, 2:57pm

Dear Yun,

Thanks for your help!

I finally found out where the error came from. When I did normalization on my data, some value divides 0 and got np.nan and when more and more data were fed to the model, the nan error propagates through the network.

Thanks very much for your help!

3801e223979e98646718 · April 13, 2019, 10:55am

So the problem was not about Embedding layer? Also, i cannot find normalization part in your code attached

joseph280 · November 17, 2020, 3:09pm

This answer didn’t help me but I found the problem I had.

My db was so small compare to the dimensions of the tutorial I was following and my embeddings were way to big for my data available, so eventually the NaN propagates through the network. Making my embedding nets smaller (smaller number of factors / columns in the matrix) solved the NaN problem for me.