RNN Architecture Questions

vdw · July 14, 2020, 1:11am

Since you posted in nlp, I assume you work with text.

A very common application is sentence classification (e.g., for sentence classification), where each sentence is a sequence of words. Let’s say you have a batch 3 sentences, each containing 10 words (nn.LSTM and nn.GRU require by default sequences of the same length; you can look up padding and packing)

That means your batch has the shape (batch_size, seq_len), i.e., (3, 10) with the numbers above. Not that each sentence/sequence is a vector of integers reflecting the index of a word in your vocabulary.

The next step is to push the batch through a nn.Embedding layer to map words (represented by ther indices) to word vectors of size, say, 100. The output shape after the embedding layer is then (batch_size, seq_len, embed_dim), i.e., (3, 10, 100) with the numbers above.

This tensor can now serve as input for your nn.LSTM or nn.GRU which expect as input (batch_size, seq_len, input_size) – not that by default, they actually expect (seq_len, batch_size, input_size); so either you transform() for tensor or you define your RNN layer with batch_first=True.

Anyway, embed_dim, i.e., the size of your word vectors defines input_size, 100 in the example above. Summing up

batch_size is the number of sentences in your batch (e.g., 3)
seq_len is the number of items in your sequences such as words in a sentence (e.g., 10)
input_size is the size of the tensor/vector that represents a single(!) item in your sequence such as 100-dim word vectors for each word in a sentence.

The shape of inputs and outputs are very well defined; see, for example, for nn.LSTM.