Hello, I’m starting in NLP and have a simple question.

Let’s say I have a NLP model with a vocabulary of 1000 words and I want to work with sequences of 5 steps. This means that the seq_len=5 and input_size=1000. But in many tutorials and articles all I see is that the input consists of only integers(and not one-hot vectors). So why the input size must be 1000 and not just 1 since is only one integer per word? Are we dealing with only the index of the vectors? Thanks!

You should look up nn.Embedding. Word embeddings are the de-facto method to vectorize words for most network models and particularly RNNs.

An embedding is a matrix of shape (vocab_size, embed_dim). They also make the use easier since they only use word indexes as input instead of one-hot vectors. Note that multiplying a matrix with a one-hot vector simply selects the corresponding row/column anyway.