Sparse softmax over a vocabulary


Suppose my vocabulary size is 10,000. At some point, my model emits scores for 3 words that are, let’s say, at vocabulary index 22, 1576, and 9065 respectively. Variable ‘scores’ has a dimension of (1, 3). How can I obtain a log_softmax over the vocabulary size that can be used for NLLLoss?

I tried something like the following, but it seems that the gradient is not back propagating

word_scores =
index_to_update =
i = torch.LongTensor(index_to_update)
v = word_scores
word_attn_energy = torch.sparse.FloatTensor(i.t(), v, torch.Size([1, OUTPUT_DIM])).to_dense()
word_attn_energy.requires_grad = True
log_prob = F.log_softmax(word_attn_energy, dim=1)

In short, I am looking for something similar to TensorFlow’s tf.nn.sparse_softmax_cross_entropy_with_logits.