Hi, I am working on a project which is more like a generalized version of language model.
Given a sequence of sentences, I want to predict the possible future words by understanding the context. For example, if we split a given document into two halves, the first half (sequence of sentences) will be fed as input to my network and the words in the second half (non-stop words) should be predicted. For this problem, I built a network as shown below.
Input (Seq. of word embeddings) -> Encoder (BiGRU) -> Dropout -> FC Layer -> Dropout -> Linear (size = Target vocab).
I am using Multilabelsoftmargin as the loss function.
Dataset: Wikipedia , size: 10K samples (pages)
I tried with different batch sizes, optimizers and learning rates but no luck.
The problem here is that network is not learning at all. Am I doing something fundamentally incorrect? Please suggest.