Weighted input for lstm predicting next word in word2vec model (sequence generator))

I have a sequence generator trained to generate the next word vector from a word2vec model. The problem is some of the words appear in much grreater numbers than the others, and the model picks gets stuck in a loop of predicting only those words. I’ve read that you can add a weight parameter to some loss functions in order to weight the words that appear in greater numbers so that the model doesn’t predict them as much, but my loss function of choice is CosineEmbeddingLoss, which doesn’t have a weight parameter. The input and target of the lstm is a series of word2vec vectors. How can I weigh this input appropriately so as to achieve a more balanced prediction?

A better way to handle class imbalance would be to pre-process the input data and post-process the results.
eg: if there are stopwords, which you don’t necessarily want in the results, get rid of them.
I’ve used this for temperature scaling, softmax = e^(z/T) / sum_i e^(z_i/T), to bring about variations in the output. Not sure if you can use it with CosineEmbeddingLoss.