My validation loss decreases then increases

Hi all,

I am new to NLP, now I was practicing on the Yelp Review Dataset and tried to build a simple LSTM network, the problem with the network is that my validation loss decreases to a certain point then suddenly it starts increasing. I’ve applied text preprocessing and also dropouts but still no luck with those. What should I do to prevent overfitting?

I am attaching a terminal screenshot for reference.

and my network architecture is as follows:

class RNN(nn.Module):

	def __init__(self, n_vocab, embed_dim, n_hidden, n_rnnlayers, n_outputs):
		super(RNN, self).__init__()

		self.V = n_vocab
		self.D = embed_dim
		self.M = n_hidden
		self.K = n_outputs
		self.L = n_rnnlayers

		self.embed = nn.Embedding(self.V, self.D)
		self.rnn = nn.LSTM(
			input_size=self.D,
			hidden_size=self.M,
			num_layers=self.L,
			batch_first=True)

		self.dropout = nn.Dropout(p=0.2)

		self.fc = nn.Linear(self.M, self.K)

	def forward(self, X):

		# initial hidden states
		h0 = torch.zeros(self.L, X.size(0), self.M).to(device)
		c0 = torch.zeros(self.L, X.size(0), self.M).to(device)


		# Embedding layer
		# turns word indices to word vectors
		out = self.embed(X)

		# get RNN unit output
		out, _ = self.rnn(out, (h0, c0))

		# max pool
		out, _ = torch.max(out, 1)

		out = self.dropout(out)

		out = self.fc(out)

		return out

That the validation loss goes up again at some point is pretty normal and is one of your main indicator that your models starts to overfit. So typically you keep track of the model state yielding the lowest validation loss and perform the testing on the best state.

I don’t think you can truly avoid overfitting, at least not with virtually unlimited amount of training data. For most practical cases, there will be differences in the distributions of the training data and validation/test data.

However, even your training loss seem to only go down to half before converging. The only thing I’m not sure about is the max_pool part in your forward() method. What is the semantic intuition here.

Can you try it without it and simple give the last hidden state to the Dropout and Linear Layer? Just curious how the losses will develop using this more standard approach. Since you don’t use a bidirectional LSTM, it should look like this:

out, (h, c) = self.rnn(out, (h0, c0))
out = self.dropout(h[-1])
out = self.fc(out)
1 Like

Hi Chris,

Thanks for the reply. I used the global max pool because it reduced my dimension from N x T x M to N x M. I tried the solution which you proposed but seems like the same thing is happening as it was happening before.