Yes, the input for the encoder is (batch_size, seq_len)
.
Each sequence in a batch is a list/array of integers reflecting the indices of the tokens in the vocabulary. For example, a match might look like this:
[
[12, 40, 8, 105, 86, 6],
[35, 105, 86, 35, 40, 6]
]
Representing the 2 sentences “i like to watch movies .” and “you watch movies you like .” This means your vocabulary provides a mapping like
{6: ".", 8: "to", 12: "i", 35: "you", 40: "like", ...}
There is no need to convert the tokens / token indices into one-hot vector. This is what the nn.Embedding
layer is for. To clarify, this layer does not create one-hot vectors either, but accepts individual token indices as input.
You only need to appreciate that, say, token index 40
and a vocabulary size of 5678
carries the same information as a one-hot vector of size 5678
with a 1
at index 40. You can also check out this post.