I have an email dataset which I have to classify as spam or not. I am using gensim.models.Word2Vec
for creating word embeddings of the train and test emails. For the processing of the dataset, I am using the tokens
MAX_SENTENCE_TOKENS = 100
EMBEDDING_SIZE = 300
START_TOKEN = '<START>'
END_TOKEN = '<END>'
PADDING_TOKEN = '<PADDING>'
for creating sentances that are at maximum 100 tokens long. After this, I train a skip-gram Word2Vec model and I exchange the actual words for the embedding indices. For the dataset I have the following torch.utils.data.Dataset
class defined:
class Data(Dataset):
def __init__(self, x, y):
self.X = torch.Tensor(x, device=device).long()
self.y = torch.Tensor(y, device=device).long()
self.len = self.X.shape[0]
def __getitem__(self, index):
return self.X[index], self.y[index]
def __len__(self):
return self.len
Moreover, my hyperparameters and model are defined as:
BATCH_SIZE = 8
EPOCHS = 30
HIDDEN_SIZE = 64
NUM_LAYERS = 16
LEARNING_RATE = .0001
BIDIRECTIONAL = False
loss_fn = nn.BCELoss()
optim = Adam(model.parameters(), lr=LEARNING_RATE)
class LSTMSpamClassifier(nn.Module):
def __init__(self, hidden_dim, num_layers):
super(LSTMSpamClassifier, self).__init__()
self.emb = Embedding.from_pretrained(torch.Tensor(w2v.wv.vectors))
self.lstm1 = LSTM(input_size=EMBEDDING_SIZE, hidden_size=HIDDEN_SIZE, num_layers=NUM_LAYERS, bidirectional=BIDIRECTIONAL)
self.lstm2 = LSTM(input_size=HIDDEN_SIZE, hidden_size=HIDDEN_SIZE, num_layers=NUM_LAYERS, bidirectional=BIDIRECTIONAL)
self.lstm3 = LSTM(input_size=HIDDEN_SIZE, hidden_size=HIDDEN_SIZE, num_layers=NUM_LAYERS, bidirectional=BIDIRECTIONAL)
self.l1 = Linear(HIDDEN_SIZE, 128)
self.l2 = Linear(128, 16)
self.l3 = Linear(16, 1)
def forward(self, x):
embedding = self.emb(x)
out, states = self.lstm1(embedding)
out, states = self.lstm2(out) # <-- the output of this lstm layer is nan
out, states = self.lstm3(out)
final_hidden_state = out[:, -1, :]
x = F.relu(self.l1(final_hidden_state))
x = F.relu(self.l2(x))
out = F.sigmoid(self.l3(x))
return out
With my training loop being:
model.train()
for epoch in range(EPOCHS):
for x, y in train_dataloader:
y_pred = model(x)
y = y.float()
y_pred = y_pred.squeeze(-1)
loss = loss_fn(y, y_pred)
optim.zero_grad()
loss.backward()
optim.step()
print(f'EPOCH: {epoch}/{EPOCHS} | Loss: {loss}')
When I try to train this model the first batch is as I would expect, but the second batch on the second lstm layer, it outputs a torch.Tensor
of nans.
Example output:
tensor(51.2606, grad_fn=<BinaryCrossEntropyBackward0>)
EPOCH: 0/30 | Loss: 51.260623931884766
tensor(nan, grad_fn=<BinaryCrossEntropyBackward0>)
EPOCH: 1/30 | Loss: nan
Does anyone have an explanation of this behaviour?