Hello!
Could you, please, tell me please, how do I calculate the loss function for the next word prediction.
Here are all the steps:
For example, a have N sentences, and minibatchsize = 2

I get minibatch of sentences, for example:
[ 6, 7, 8 ]
[ 1, 2, 3, 4, 5 ] 
Sort mini_batch by the length
[ 1, 2, 3, 4, 5 ]
[ 6, 7, 8 ] 
Split each sentence into X and Y, get a list of lengths:
X = sentence[ : 1 ] Y = sentence[ 1 : ] X = [[ 1, 2, 3, 4 ] [ 6, 7 ] ] Y = [[ 2, 3, 4, 5 ] [ 7, 8 ] ] X_lengths = [4, 3]

Pad sentences by Pad_Index ( 0 in my case):
X = [[ 1, 2, 3, 4 ] [ 6, 7, 0, 0 ] ] Y = [[ 2, 3, 4, 5 ] [ 7, 8, 0, 0 ] ]

model.zero_grad()

get embeddings
(word_embeddings = nn.Embedding(vocab_size,
embedding_dim,
padding_idx=0))
X = word_embeddings(X)
X.size() = torch.Size([mini_batch_size, seq_length, embedding_dim])

Transpose:
X = torch.transpose(X, 0, 1)
X.size() = torch.Size([seq_length, mini_batch_size, embedding_dim])

Pack by:
X = torch.nn.utils.rnn.pack_padded_sequence(X, X_lengths)

Initialise hidden units:
h_t = torch.Size([ layers_dim, mini_batch_size, hidden_dim])
h_c = torch.Size([ layers_dim, mini_batch_size, hidden_dim]) 
LSTM:
lstm_out, (h_t, h_c) = lstm(X, (h_t, h_c))

Pad by:
lstm_out, _ = torch.nn.utils.rnn.pad_packed_sequence(lstm_out)
lstm_out.size() = torch.Size([seq_length, mini_batch_size, hidden_dim])

Linear layer:
fc = nn.Linear(hidden_dim, vocab_size)
linear_out = self.fc(lstm_out)
linear_out.size() = torch.Size([seq_length, mini_batch_size, vocab_size])

softmax:
Y_hat = F.log_softmax( linear_out, dim=1)
Y_hat.size() = torch.Size([seq_length, mini_batch_size, vocab_size])
So, the question is:
How do I get the loss, if:
Y_hat > torch.Size([seq_length, mini_batch_size, vocab_size])
Y > torch.Size([mini_batch_size, seq_length])
And am I doing previous steps right?
Thanks in advance!