Why 3d input tensors in LSTM?

Hi there, I’m new to pytroch (and the community!).

Sorry in advance if this is a silly question but as I’m getting my feet wet with LSTMs and learn pytorch at the same time I’m confused about how nn.LSTM ingests its inputs. From the main pytorch tutorial and the time sequence prediction example it looks like the input for an LSTM is a 3 dimensional vector, but I cannot understand why.

At the end of this thread it is mentioned that the three elements of the input are time dimension (5), feature dimension (3) and mini-batch dimension (100). The suggested 3d tensor is 5 x 100 x 3, which corresponds to time x batch x features.

I’m having trouble understanding this.

If I have a single time series of length 10000, but use mini batches of, say, length 50, and have no feature inputs other than the series itself, would this translate to a ? x 50 x 1 tensor?

What is referred as “time” in the thread above?

Is the batch size the size of the sliding window of observations that I will use to predict the “next” observed datapoint?

I think related to my misunderstanding is the fact that the time sequence prediction example seems to be predicting multiple sine waves for each point in time, and in this thread @spro says that “any kind of continued sequence prediction counts as many-to-many”, which just adds to my confusion.

I feel my questions all stem from a misunderstanding of something very basic and connected.

Thanks for any help.


The minibatch dimension refers to the number of sequences you want to process in parallel. So if you divide a time series of length 10000 into chunks of length 50, your input tensor would be 50 (timesteps) by 200 (batch size) by 1 (features).


That’s very helpful thanks for taking the time to answer. In terms of performance, are there any best practices in the relationship between timesteps and batch size (high/low, low/high, etc)?

Large batches with few timesteps -> faster
Small batches with many timesteps -> better generalization/accuracy


Hi @jekbradbury in the example you gave in the first reply, 50 x 200 x 1, which you mean:

  1. there are 200 batches, each batch is 50x1
  2. there are 50 batches, each batch is 200x1


There are 200 batches in the dataset; each batch is 50x200x1.

Thanks @jekbradbury. Maybe I’m missing something here.

Let’s say we have some time series data in a 2D dataframe, 10,000 rows, 1 data column, i.e. 10k x 1.

To turn this into batches for torch.nn.LSTM, we make this into 200 batches, so in 2D form that’s 200 dataframes each with dimension 50 x 1.

Since torch.nn.LSTM needs a 3D tensor, we reshape this frame to dimension 50 x 200 x 1 and use this entire 3D tensor as the input for LSTM's forward function.

Is that the correct understanding?

I think I figured it this, time step here is just the number of layers in a nn here. All make sense to me know.

Actually the above is wrong, see @chilango’s note below.

I don’t think that’s accurate, timesteps is the number of elements in your input (or output) sequence.

@zhidali Input needs to be a 3d tensor with dimensions (seq_len, batch_size, input_size), so: length of your input sequence, batch size, and number of features (if you only have the time series there is only 1 feature). If you train with a batch of size 1 the input tensor would be 50x1x1. I’m also learning but I think its accurate :slight_smile:


Hi Alex, thanks for the reply. :smile:
These two days I figure out my model. 200 batch, each batch is 50x1x1.
I set my batch size to 1, but I still can control the sequence length, I use DataLoader and it can help me to diivide my data in serveral batch, when I train, I just set the sequence length is the same number as the batch size in dataLoader.

Hi @chilango, thanks for the comment.

Thinking about it again, I got it wrong. Messed up the definitions of recurrence and layers. Thanks for pointing this out.

each batch if of size 50 x 1 x 1 right? There are 200 batches of size 50 x 1 x 1 , so that the entire input tensor is 50 x 200 x 1.

I beg to differ, in many tasks large batch size means better representation of data distribution, which would lead to better convergence.

1 Like

Just for reference about the batch size problem, here’s a recent OpenAI post/paper about just that. TL;DR You wanna get the gradient noise scale / batch size near 1.

1 Like