I am guessing this is only for the inference part. The training speed for both the first case and the second case is still pretty similar?
Remember teacher forcing can be used only during training and not during inference.
What you are doing in the first case for inference looks fine to me. If you don’t want to break on <EOS> token, you can prepare a list of the required size (max_utter_len) and then fill it using decoded_words[i] = top1.squeeze(-1) and word_distributions[i] = word_distribution. This will give marginal improvement.