Does anyone have some good guides that explaining when and how to resize tensors in the forward pass?

I have been trying to apply various RNN models I found online to my dataset, but the error I get most often is one involving dimensionality (e.g. expected 3 dimensions but got 2). My data is in the form of sentences (i.e. text data), of which each is assigned a sentiment score as labels. Of the models I have found, one involved input as images, which I know have more dimensions than text I believe, and another model used sentences as well but the labels were parts-of-speech tags, so instead of each sentence having 1 label, each word in the sentence had 1 label.

I noticed the key to understand this seems to be in the forward pass function of the model class. There’s often some kinds of resizing of the tensors occurring with the .view() or .contiguous(), or even something like this: output = output[-1,:,:]

Is there anywhere I can learn more about this in particular? I understand the theory behind RNNs and what’s going on in the rest of the model class (for the most part), but this part I cannot seem to grasp.

Also, while coding it seems like it is often wise to check the .size() or .shape of your tensors at various points in order to figure out more about what the dimensionality errors might be. However, I’m also not sure at what points in the code I should put these checks and how best to use them. If anyone knows of any guides related to doing checks like this, I would appreciate it. Thanks!

Reading through your question I’m getting a feeling that the error is with regards to your input dimensions and not your output. All RNN modules in PyTorch expect a 3D input of type (seq_len, batch, input_size) (See And I’m guessing that that’s where the error is from.

In terms of NLP related tasks usually, you create a dataset of seq_len x batch dimensions which in turn you pass through an nn.Embedding ( sparse layer to get the input_size dimension thus creating the proper input to pass to the RNN layers.

seq_len is a segment of a document (a paragraph, a sentence or even a part of a sentence) which is tokenized and each token is being represented by a numeric id. So the sentence “What do you get when you multiply six by seven?” becomes [ 11 24 31 2 7 10 38 30 37 33 34 ]. This is your first dimension.

Say now you have multiple sentences that you could process in parallel. You group them together (perhaps padding them first so that all have the same length) and you get the second dimension which is the batch. (So now you have a batch, a group, of equal length of inputs)

At this point you usually map each word to a vector which represents the “meaning” (loosely speaking) of the word. This could be randomly initialised or one could use an existing vector created via word2vec, glove etc. This vector is your input_size.

Ah okay, I have done the first two dimensions in this case, my data is in the form of numeric tokens as you mentioned, and each sentence is padded to be the same length. It is the last part with which I am not too familiar. I know of embedding techniques like word2vec and glove. So I understand what you mean when you wrote “you usually map each word to a vector which represents the meaning” but how is this done in the code itself? Is this what it means to initialise the hidden state? Is this hidden state the vector mappings of your words? I have seen code where the hidden state is initialised within the Class defining the model though, so I don’t understand how you map before instantiating the class. Or I could be way off and it is something else?
And what kind of Python data structure would you store this three dimensional data in then? Thanks!

For some more context, the RNN I used started with an embedding layer, then an lstm, followed by a fully connected layer and finally a SoftMax output. What you wrote makes it seem like my data (which is basically a list of lists, with each document being one list, inside which each sentence is another list with each work a numeric token) is in the correct form to be passed into the model as long as it begins with an embedding layer. Or maybe I am missing something.

This is where you are supposed to use the nn.Embedding layer.

For example:

self.embedding = nn.Embedding(vocabulary_size, embedding_dimensions)

The vocabulary_size is the how many words you have in your vocabulary and the embedding_dimensions is the size of the vector that you are using for each word. (Constructing in essence a vocab x dimensions tensor, which then is used by the embedding layer to construct your final seq_len x batch x dimensions tensor that RNNs need)

and you use it in the forward path before the RNN:

def forward(self, inputs, hidden):
        embeddings = self.embeddings(inputs)
        output, hidden = self.lstm(embeddings, hidden)

You could either initialise the weights of the embedding layer randomly from a distribution:, 0.1)

Or if you want to re-use trained embeddings from some other source (say word2vec) you must pass the vectors to the embedding layer:

You need to construct an id to vector mapping during your tokenization phase (I don’t know if there is a different/ better way to go about this). Something like this:

from gensim.models import Word2Vec

embeddings = Word2Vec.load('word2vec.model')

vectors = {}
wordToIndex = {}
indexToWord = []

ids = []

for word in sentence:
    if word not in wordToIndex:
        idx = len(indexToWord) - 1
        wordToIndex[word] = idx
        idx = wordToIndex[word]

    if word in embeddings:
        vectors[idx] = embeddings[word]


After processing your entire corpus the vectors dict contains a idx to vector pair which you can convert to a numpy array (i.e. embedding_weights = np.asarray(list(vectors.values()), dtype=np.float32)) and pass to the embedding layer.

Obviously the example above does not handle unknown tokens or words that don’t have a vector in the embeddings you are using which you should handle

Final note. You could either stop the gradients from flowing on the embedding layer thus keeping your embeddings frozen or not by using the requires_grad variable in the embedding weights

Oh! okay I think it is starting to make some sense. I have a dataset and have been trying to run it through various LSTM models I found online to practice. I ran it through one successfully, it had an embedding layer and initialised the weights when the model was instantiated before training (is this always the place where you initialise the embedding layer weights?).

I also found 2 models online from tutorials, I wondered if you might be able to tell me if my intuition on why they didn’t work for my data is correct?

Here is the first model:
It is a bidirectional lstm written for image data from MNIST. My data is text of course. I believe, as you just mentioned, it did not work because it did not have an embedding layer in the model, hence the dimension error, correct?

This second link, however ( does have an embedding layer. However, it does not seem to initialise the embedding layer weights anywhere. I tried to pass it a single sentence from my data as they did in the tutorial, but I still got the dimensionality error (requires 3 dimensions, got 2). What was wrong in this case?

Thanks again for your very detailed and helpful responses!