What can be the reason my test loss is so low?

Hi,

I’m trying to train a language model using a BiLSTM, but I’m getting really weird values for the test loss.

A training epoch looks like this:

for batch in tqdm(get_batches(train, BATCH_SIZE, shuffle=True)):
        model.zero_grad()
        
        X, y, lengths = batch
        batch_size, seq_len = X.size()
        hidden = model.init_hidden(batch_size)
        
        yhat, hidden = model(X, lengths, hidden)
        yhat = yhat.permute(1, 2, 0)
        loss = loss_function(yhat, y)
        loss.backward()
        optimizer.step()
        total_train_loss += loss.item()
        hidden = (hidden[0].detach(), hidden[1].detach())

and my test loop looks like this:

model.eval()
with torch.no_grad():
    for batch in tqdm(get_batches(test, 1, shuffle=False)):
        X, y, lengths = batch
        batch_size, seq_len = X.size()
        hidden = model.init_hidden(batch_size)
        
        yhat, hidden = model(X, lengths, hidden)
        yhat = yhat.permute(1, 2, 0)
        loss = loss_function(yhat, y)

        total_test_loss += loss.item()
        hidden = (hidden[0].detach(), hidden[1].detach())

I’m getting a loss of 1.43, and that’s without dividing it by the number of batches.

Anyone has any idea why would that happen? I’m using nn.CrossEntropyLoss(ignore_index=PAD) as the loss function.

Thanks!

I’ve seen in several tutorials that I need to view my yhat as yhat.view(-1, VOCAB_SIZE). Can anyone explain why would I need (or don’t need) to do that?

The tensor y is a 2D tensor of sshape (batch_size, VOCAB_SIZE). So, when you pass yhat and y to loss_fucntion, you want to make sure that yhat has the same shape as y. So using yhat.view(-1, VOCAB_SIZE) will do the job.

It also runs without it, so I was wondering if that’s important… My loss seems wrong on both instances…

Can you show the way you defined loss_function?

loss_function = nn.CrossEntropyLoss(ignore_index=PAD) where PAD is zero.

CrossEntropyLoss has a parameter called reduction which its default value is reduction='elementwise_mean'. That means it automatically devide the computed loss by all elements which is batch_size * VOCAB_SIZE.

If you do not want to device by the VOCAB_SIZE and only compute the mean of the loss with respect to each batch, then you should do the following:

## define the loss
loss_function = nn.CrossEntropyLoss(ignore_index=PAD, reduction='sum')

and in your training call it as follows

loss = loss_function(yhat, y) / len(y)

I’ll try that, thanks!

No luck… Now the loss is huge, and validation loss is rising instead of lowering… Maybe I’m missing something.
How would you train this model?

class BiLSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_size, num_layers, dropout_prob=0.5):
        super(BiLSTM, self).__init__()
        self.embedding_dim = embedding_dim
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        self.dropout = nn.Dropout(p=dropout_prob)
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_size, num_layers, batch_first=True, dropout=dropout_prob, bidirectional=True)
        self.fc = nn.Linear(2 * hidden_size, vocab_size)
    
    def forward(self, input, lengths, hidden):
        embed = self.dropout(self.embedding(input))
        packed = pack_padded_sequence(embed, lengths, batch_first=True)
        packed_out, hidden = self.lstm(packed, hidden)
        out, _ = pad_packed_sequence(packed_out)
        out = self.dropout(out)
        out = self.fc(out)
        return out, hidden
    
    def init_hidden(self, batch_size):
        return (torch.zeros(2 * self.num_layers, batch_size, self.hidden_size).to(device),
                torch.zeros(2 * self.num_layers, batch_size, self.hidden_size).to(device))

Thanks again!

Now the loss is not divided by the VOCAB_SIZE so it is expected to be higher. If there is a problem with the training, then you might need to reduce the learning_rate. The reason is given that the loss is scaled, then the gradients are scaled as well, so the update steps are larger. So you can lower the learning_rate to avoid that.

I thought if I used Adam then playing with the learning rate will be less of a problem. You recommend another optimizer?

Well, it’s true that Adam is able to adjust the learning rate based on the estimated moments. However, if the initial learning rate is to high, it can diverge quickly.

By monitoring the training losses you can understand if it is diverging due to high learning rate or not.