I am semi-new to nlp and language modeling and I was trying to duplicate the pytorch example for the word_language_model with my own code and I got stuck when generating output after training the RNN. In previous models I have used I generally got output by just using torch.max() but I noticed that this did not work for my model and the only way I could get actual sentences was by copying what is in the generate.py file.

I donâ€™t understand why after getting the output we have to use .div().exp().cpu() or what the purpose of torch.multinomial() is. I tried reading the docs but I didnâ€™t really see how it applies to this scenario. If anyone could help explain this I would greatly appreciate it, thanks!

I am starting with multinomial (as it is a more straightforward part).

When you have a trained language model, you need to have a strategy of sampling new sentences. The easiest way is to get a token with maximum likelihood - I assume that you already did it with max(). But this scenario has a significant drawback: your decoding is greedy. At every decoding/sampling step, the only possible choice is the top token, resulting in deterministic behavior and inability to sample anything else. To add some variance to the results, we can randomize the next word.

Imagine at timestep t we have the following predictions:
a: 0.15
the: 0.10
is: 0.09
was: 0.05
Max approach will always select â€śaâ€ť. If we want to give slightly less likely tokens a chance to appear, we can select randomly. But not in a way when each token has the same probabiliy. We want to sample them with relative likelihoods. Thatâ€™s why we use multinomial function. It might be easier if you are familiar with numpy choice function with p parameter used: https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.choice.html#numpy.random.Generator.choice

It allows us to generate a different sentence every time but keeping relative probability.

(there are also more sophisticated methods like beam search).

Example code also uses a concept of temperature. Basically, it is a calibration(?) of given probabilities. If the temperature is low, the sampler will be more conservative - sticking more to those very likely tokens. On the other hand, if the temperature is high - less likely tokens will have a relatively higher likelihood - therefore, there is a bigger chance they will be selected. You can somewhat expect that low temperatures will produce â€śboring but correctâ€ť results and high â€śinteresting with errorsâ€ť in practice.