Multiclass classification using LSTM

Hi, I am working on a project which is more like a generalized version of language model.
Given a sequence of sentences, I want to predict the possible future words by understanding the context. For example, if we split a given document into two halves, the first half (sequence of sentences) will be fed as input to my network and the words in the second half (non-stop words) should be predicted. For this problem, I built a network as shown below.

Input (Seq. of word embeddings) -> Encoder (BiGRU) -> Dropout -> FC Layer -> Dropout -> Linear (size = Target vocab).

I am using Multilabelsoftmargin as the loss function.
Dataset: Wikipedia , size: 10K samples (pages)

I tried with different batch sizes, optimizers and learning rates but no luck.

The problem here is that network is not learning at all. Am I doing something fundamentally incorrect? Please suggest.

Can you try by building a language model ( with the first-half of sentences and then predict the next half ( as outlined in below tutorial?