# What can be the reason my test loss is so low?

Hi,

I’m trying to train a language model using a BiLSTM, but I’m getting really weird values for the test loss.

A training epoch looks like this:

``````for batch in tqdm(get_batches(train, BATCH_SIZE, shuffle=True)):

X, y, lengths = batch
batch_size, seq_len = X.size()
hidden = model.init_hidden(batch_size)

yhat, hidden = model(X, lengths, hidden)
yhat = yhat.permute(1, 2, 0)
loss = loss_function(yhat, y)
loss.backward()
optimizer.step()
total_train_loss += loss.item()
hidden = (hidden[0].detach(), hidden[1].detach())
``````

and my test loop looks like this:

``````model.eval()
for batch in tqdm(get_batches(test, 1, shuffle=False)):
X, y, lengths = batch
batch_size, seq_len = X.size()
hidden = model.init_hidden(batch_size)

yhat, hidden = model(X, lengths, hidden)
yhat = yhat.permute(1, 2, 0)
loss = loss_function(yhat, y)

total_test_loss += loss.item()
hidden = (hidden[0].detach(), hidden[1].detach())
``````

I’m getting a loss of 1.43, and that’s without dividing it by the number of batches.

Anyone has any idea why would that happen? I’m using `nn.CrossEntropyLoss(ignore_index=PAD)` as the loss function.

Thanks!

I’ve seen in several tutorials that I need to view my `yhat` as `yhat.view(-1, VOCAB_SIZE)`. Can anyone explain why would I need (or don’t need) to do that?

The tensor `y` is a 2D tensor of sshape `(batch_size, VOCAB_SIZE)`. So, when you pass `yhat` and `y` to `loss_fucntion`, you want to make sure that `yhat` has the same shape as `y`. So using `yhat.view(-1, VOCAB_SIZE)` will do the job.

It also runs without it, so I was wondering if that’s important… My loss seems wrong on both instances…

Can you show the way you defined `loss_function`?

`loss_function = nn.CrossEntropyLoss(ignore_index=PAD)` where `PAD` is zero.

CrossEntropyLoss has a parameter called `reduction` which its default value is `reduction='elementwise_mean'`. That means it automatically devide the computed loss by all elements which is `batch_size * VOCAB_SIZE`.

If you do not want to device by the VOCAB_SIZE and only compute the mean of the loss with respect to each batch, then you should do the following:

``````## define the loss
``````

and in your training call it as follows

``````loss = loss_function(yhat, y) / len(y)
``````

I’ll try that, thanks!

No luck… Now the loss is huge, and validation loss is rising instead of lowering… Maybe I’m missing something.
How would you train this model?

``````class BiLSTM(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_size, num_layers, dropout_prob=0.5):
super(BiLSTM, self).__init__()
self.embedding_dim = embedding_dim
self.hidden_size = hidden_size
self.num_layers = num_layers

self.dropout = nn.Dropout(p=dropout_prob)
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.lstm = nn.LSTM(embedding_dim, hidden_size, num_layers, batch_first=True, dropout=dropout_prob, bidirectional=True)
self.fc = nn.Linear(2 * hidden_size, vocab_size)

def forward(self, input, lengths, hidden):
embed = self.dropout(self.embedding(input))
packed_out, hidden = self.lstm(packed, hidden)
out = self.dropout(out)
out = self.fc(out)
return out, hidden

def init_hidden(self, batch_size):
return (torch.zeros(2 * self.num_layers, batch_size, self.hidden_size).to(device),
torch.zeros(2 * self.num_layers, batch_size, self.hidden_size).to(device))
``````

Thanks again!

Now the loss is not divided by the `VOCAB_SIZE` so it is expected to be higher. If there is a problem with the training, then you might need to reduce the `learning_rate`. The reason is given that the loss is scaled, then the gradients are scaled as well, so the update steps are larger. So you can lower the `learning_rate` to avoid that.

I thought if I used Adam then playing with the learning rate will be less of a problem. You recommend another optimizer?

Well, it’s true that Adam is able to adjust the learning rate based on the estimated moments. However, if the initial learning rate is to high, it can diverge quickly.

By monitoring the training losses you can understand if it is diverging due to high learning rate or not.