Hello!
Could you, please, tell me please, how do I calculate the loss function for the next word prediction.
Here are all the steps:
For example, a have N sentences, and mini-batch-size = 2
-
I get mini-batch of sentences, for example:
[ 6, 7, 8 ]
[ 1, 2, 3, 4, 5 ] -
Sort mini_batch by the length
[ 1, 2, 3, 4, 5 ]
[ 6, 7, 8 ] -
Split each sentence into X and Y, get a list of lengths:
X = sentence[ : -1 ] Y = sentence[ 1 : ] X = [[ 1, 2, 3, 4 ] [ 6, 7 ] ] Y = [[ 2, 3, 4, 5 ] [ 7, 8 ] ] X_lengths = [4, 3]
-
Pad sentences by Pad_Index ( 0 in my case):
X = [[ 1, 2, 3, 4 ] [ 6, 7, 0, 0 ] ] Y = [[ 2, 3, 4, 5 ] [ 7, 8, 0, 0 ] ]
-
model.zero_grad()
-
get embeddings
(word_embeddings = nn.Embedding(vocab_size,
embedding_dim,
padding_idx=0))
X = word_embeddings(X)
X.size() = torch.Size([mini_batch_size, seq_length, embedding_dim])
-
Transpose:
X = torch.transpose(X, 0, 1)
X.size() = torch.Size([seq_length, mini_batch_size, embedding_dim])
-
Pack by:
X = torch.nn.utils.rnn.pack_padded_sequence(X, X_lengths)
-
Initialise hidden units:
h_t = torch.Size([ layers_dim, mini_batch_size, hidden_dim])
h_c = torch.Size([ layers_dim, mini_batch_size, hidden_dim]) -
LSTM:
lstm_out, (h_t, h_c) = lstm(X, (h_t, h_c))
-
Pad by:
lstm_out, _ = torch.nn.utils.rnn.pad_packed_sequence(lstm_out)
lstm_out.size() = torch.Size([seq_length, mini_batch_size, hidden_dim])
-
Linear layer:
fc = nn.Linear(hidden_dim, vocab_size)
linear_out = self.fc(lstm_out)
linear_out.size() = torch.Size([seq_length, mini_batch_size, vocab_size])
-
softmax:
Y_hat = F.log_softmax( linear_out, dim=1)
Y_hat.size() = torch.Size([seq_length, mini_batch_size, vocab_size])
So, the question is:
How do I get the loss, if:
Y_hat -> torch.Size([seq_length, mini_batch_size, vocab_size])
Y -> torch.Size([mini_batch_size, seq_length])
And am I doing previous steps right?
Thanks in advance!