Confusing about the dimension of Seq2Seq model

Yes, the input for the encoder is (batch_size, seq_len).

Each sequence in a batch is a list/array of integers reflecting the indices of the tokens in the vocabulary. For example, a match might look like this:

[
    [12, 40, 8, 105, 86, 6],
    [35, 105, 86, 35, 40, 6]
]

Representing the 2 sentences “i like to watch movies .” and “you watch movies you like .” This means your vocabulary provides a mapping like

{6: ".", 8: "to", 12: "i", 35: "you", 40: "like", ...}

There is no need to convert the tokens / token indices into one-hot vector. This is what the nn.Embedding layer is for. To clarify, this layer does not create one-hot vectors either, but accepts individual token indices as input.

You only need to appreciate that, say, token index 40 and a vocabulary size of 5678 carries the same information as a one-hot vector of size 5678 with a 1 at index 40. You can also check out this post.