LSTM layer outputing nan on second batch

I have an email dataset which I have to classify as spam or not. I am using gensim.models.Word2Vec for creating word embeddings of the train and test emails. For the processing of the dataset, I am using the tokens

MAX_SENTENCE_TOKENS = 100
EMBEDDING_SIZE = 300
START_TOKEN = '<START>'
END_TOKEN = '<END>'
PADDING_TOKEN = '<PADDING>'

for creating sentances that are at maximum 100 tokens long. After this, I train a skip-gram Word2Vec model and I exchange the actual words for the embedding indices. For the dataset I have the following torch.utils.data.Dataset class defined:

class Data(Dataset):
    def __init__(self, x, y):
        
        self.X = torch.Tensor(x, device=device).long()
        
        self.y = torch.Tensor(y, device=device).long()
        
        self.len = self.X.shape[0]
       
    def __getitem__(self, index):
        return self.X[index], self.y[index]
   
    def __len__(self):
        return self.len

Moreover, my hyperparameters and model are defined as:

BATCH_SIZE = 8
EPOCHS = 30
HIDDEN_SIZE = 64
NUM_LAYERS = 16
LEARNING_RATE = .0001
BIDIRECTIONAL = False

loss_fn = nn.BCELoss()
optim = Adam(model.parameters(), lr=LEARNING_RATE)

class LSTMSpamClassifier(nn.Module):
    
    def __init__(self, hidden_dim, num_layers):
        
        super(LSTMSpamClassifier, self).__init__()
        
        self.emb = Embedding.from_pretrained(torch.Tensor(w2v.wv.vectors))
        
        self.lstm1 = LSTM(input_size=EMBEDDING_SIZE, hidden_size=HIDDEN_SIZE, num_layers=NUM_LAYERS, bidirectional=BIDIRECTIONAL)
        
        self.lstm2 = LSTM(input_size=HIDDEN_SIZE, hidden_size=HIDDEN_SIZE, num_layers=NUM_LAYERS, bidirectional=BIDIRECTIONAL)
        
        self.lstm3 = LSTM(input_size=HIDDEN_SIZE, hidden_size=HIDDEN_SIZE, num_layers=NUM_LAYERS, bidirectional=BIDIRECTIONAL)
        
        self.l1 = Linear(HIDDEN_SIZE, 128)
        
        self.l2 = Linear(128, 16)
        
        self.l3 = Linear(16, 1)

        
    def forward(self, x):

        embedding = self.emb(x)

        out, states = self.lstm1(embedding)

        out, states = self.lstm2(out)   # <-- the output of this lstm layer is nan
        
        out, states = self.lstm3(out)
        
        final_hidden_state = out[:, -1, :]

        x = F.relu(self.l1(final_hidden_state))
        
        x = F.relu(self.l2(x))
        
        out = F.sigmoid(self.l3(x))
        
        return out

With my training loop being:

model.train()

for epoch in range(EPOCHS):
    
    for x, y in train_dataloader:
        
        y_pred = model(x)

        y = y.float()
        y_pred = y_pred.squeeze(-1)

        loss = loss_fn(y, y_pred)
        
        optim.zero_grad()

        loss.backward()
        
        optim.step()
        
        
    print(f'EPOCH: {epoch}/{EPOCHS} | Loss: {loss}')

When I try to train this model the first batch is as I would expect, but the second batch on the second lstm layer, it outputs a torch.Tensor of nans.

Example output:

tensor(51.2606, grad_fn=<BinaryCrossEntropyBackward0>)
EPOCH: 0/30 | Loss: 51.260623931884766
tensor(nan, grad_fn=<BinaryCrossEntropyBackward0>)
EPOCH: 1/30 | Loss: nan

Does anyone have an explanation of this behaviour?

Just some comments and questions

  • Why do you have lstm1, lstm2, and lstm3, with each of this layers having 16 layers. This means you have a LSTM layer with 3*16=48 layers. A single lstm with maybe 1-3 layers should be more than enough.

  • I assume the shape of embedding is (batch_size, embedding_size). However, you define the LSTM layers with batch_first=False (that’s the default). So you probably need to do embedding = embedding.transpose(1,0) first.

  • With batch_first=False, the shape of out is (seq_len, batch_size, 2*hidden_size). So out[:, -1, :] is wrong, and it should be simply out[-1]

  • I have (or had now) so many layers in lstm because I thought I needed that many.
  • Instead of transposing he embeddings and re-selecting the output of the final lstm I added batch_first=True to all of the lstm layers. Still, the output is:
tensor(51.3910, grad_fn=<BinaryCrossEntropyBackward0>)
EPOCH: 0/30 | Loss: 51.39103698730469
tensor(nan, grad_fn=<BinaryCrossEntropyBackward0>)
EPOCH: 1/30 | Loss: nan