Yes, the sequence length is 5. The 3 stems from the 3 probabilities that you get for each token since you have 3 possible tags (in this simplified) example here. For example, if you replace
tag_scores = F.log_softmax(tag_space, dim=1)
with
tag_scores = F.softmax(tag_space, dim=1)
the 3 probabilities in each of the 5 rows should sum up to 1. The value in targets
are 0, 1, or 2, reflecting with class/probability is the correct one for each token.
Side note: I’m note sure, but extending this tutorial to mini-batches might be tricky. This tutorial uses a lot of view()
commands that might cause problems.