RNN Architecture Questions

I’m learning about RNNs from the Udacity PyTorch class, and I think I understand the basic concept. Now I’m trying to get myself familiar with the details, starting with understanding the dimensions of inputs and outputs of different layers in an RNN.

  • input_size – The number of expected features in the input x.
  • hidden_size – The number of features in the hidden state h.
  • num_layers – Number of recurrent layers.

I copied these parameter descriptions from the docs, but I’m having a hard time visualizing my network based on these parameters.

  1. What do the docs mean by “features in the input x”? Does it refer to having multiple variables? Or like an image might have multiple channels?
  2. What are “features in the hidden state”? Again, the word “features” seems pretty ambiguous, and I don’t know what it means at all.

I get that the hidden state has some “memory” of its previous self and is updated by combining the current input to the hidden layer with the previous hidden state. I don’t get the specifics of what’s going on though. For example, here is a question from the class, which I have no idea how to answer:

Say you’ve defined a GRU layer with input_size = 100 , hidden_size = 20 , and num_layers=1 . What will the dimensions of the hidden state be if you’re passing in data, batch first, in batches of 3 sequences at a time?

Unfortunately, RNNs are not explained as well as CNNs were in the previous lesson. Thanks for your help!

Very crudely speaking, number of features refer to the size of the vectors/tensors.

Outside end-to-end neural networks, the term feature had a more tangible meaning, since feature engineering was an import processing step you had to do “manually”. For example, to use a SVM to classify a text document had extract the text document into a set of features. This could some very naive features such as #words, #characters. In this case, each document would be represented by 2 numerical features. In practice, you would have more meaningful features but they would have a clear semantic meaning.

With end-to-end neural networks, this semantic meaning is usually latent and not obvious. For example, each word in a text may be represented by a 300-dim vector of numerical values, i.e., the word has 300 features. This vector places a word in a 300-dim space in relation to other words. However, you usually don’t really know what an individual value in the vector means. For example, The 25th entry in the 300-dim vector does not tell you that the word is a noun, verb, adjective, etc.

So in case you use an LSTM for text processing where the LSTM processes a sequences of words, each words is represented by a vector of size input_size. This representation is needed since word are symbolic representations. For example, the words “cat” and “kitten” are only similar to you because your mental models of both concepts (animal, 4 legs, furry, meows, etc.) are similar. For a computer, these are completely different things. A vector representation now allows to map “cat” and “kitten” (and all other word) to a numerical representation where the vector for “cat” and the vector for “kitten” are closer together compared to, say, the vector of “cat” and the vector of “train”.

In contrast, if you use an LSTM for time series prediction of already scalar numerical values, input_size is just 1.

For the resulting dimensions of the hidden state, you best consult the PyTorch docs, but it definitely does not depend on the input_size.

1 Like

I’m afraid I don’t quite understand your explanation. I’m sorry!

I’ve scoured the PyTorch RNN docs, and there is little to no explanation of how data should be formatted for input, and certainly no information about how the shape of the data is changing as it moves through the network.

For example, I THINK my forward pass I need to have my data in the shape (batch_size, seq_length, input_size), but there are no concrete examples of what seq_length is or what input_size is. How are they different?

Since you posted in nlp, I assume you work with text.

A very common application is sentence classification (e.g., for sentence classification), where each sentence is a sequence of words. Let’s say you have a batch 3 sentences, each containing 10 words (nn.LSTM and nn.GRU require by default sequences of the same length; you can look up padding and packing)

That means your batch has the shape (batch_size, seq_len), i.e., (3, 10) with the numbers above. Not that each sentence/sequence is a vector of integers reflecting the index of a word in your vocabulary.

The next step is to push the batch through a nn.Embedding layer to map words (represented by ther indices) to word vectors of size, say, 100. The output shape after the embedding layer is then (batch_size, seq_len, embed_dim), i.e., (3, 10, 100) with the numbers above.

This tensor can now serve as input for your nn.LSTM or nn.GRU which expect as input (batch_size, seq_len, input_size) – not that by default, they actually expect (seq_len, batch_size, input_size); so either you transform() for tensor or you define your RNN layer with batch_first=True.

Anyway, embed_dim, i.e., the size of your word vectors defines input_size, 100 in the example above. Summing up

  • batch_size is the number of sentences in your batch (e.g., 3)
  • seq_len is the number of items in your sequences such as words in a sentence (e.g., 10)
  • input_size is the size of the tensor/vector that represents a single(!) item in your sequence such as 100-dim word vectors for each word in a sentence.

The shape of inputs and outputs are very well defined; see, for example, for nn.LSTM.