Hi,
I recently learning deep learning and was experimenting with the language model example provide by Pytorch here - https://github.com/pytorch/examples/tree/master/word_language_model
In the generate.py
script here - https://github.com/pytorch/examples/blob/master/word_language_model/generate.py#L65, I don’t understand how we get the output word from all the word weights by sampling from a multinomial distribution.
output, hidden = model(input, hidden)
word_weights = output.squeeze().data.div(args.temperature).exp().cpu()
word_idx = torch.multinomial(word_weights, 1)[0]
input.data.fill_(word_idx)
word = corpus.dictionary.idx2word[word_idx]
If I assume the word_weights as the probabilities of all words, then we should pick the word with the highest probability (thinking about softmax here) . But, I could not understand logically, what’s the benefit/reason behind sampling from a multinomial distribution.
I played around a bit and tried to sample 2 word indices
and print their corresponding word_weights
and noticed that we don’t necessarily take the word with the higher weight (due to sampling). But I don’t understand the reason behind it.
I understand that this is not necessarily a Pytorch question, but would appreciate if someone could share details behind the sampling.