Confusion about 3D LSTM input

Hi all,

I want to build a simple LSTM model and am a bit confused about the 3D input dimensions. According to the pytorch documentation the 3 dimensions represent (seq_len, batch, input_size). However, I cannot figure out why I need both the sequence length and the batch size here. Say, I have a 5 dimensional timeseries, that is, 5 feature dimensions. The timeseries has a total length of, say, 1000, with a batch size of 100. So I have read in some other threads, that in this case my input tensor would be (10, 100, 5) (I think). But does that mean that I feed my entire timeseries in at once in that 3D tensor?

Is there a certain reason why the LSTM would take the whole timeseries at once in a 3D tensor?

What would I be doing if I just want to do the splitting in batches in the train function as usual, i.e., that I iterate over the dataloader and sample batches of size 100 which I feed into the forward method of the LSTM?

Best wishes and thank you in advance!

If you check the sourcecode it just does iteratively run the lstm for input 1, adjust hidden state and process input 2 and so on.

but what is then the difference between seq_len and batch? Why do we need that distinction?

Because a single sample is a sequence, eg, a sentence to be translated. However, you don’t usually train with a single sample but several, a batch to smooth gradients and to obtain generalizable normalization statistics.

To me it still seems that one of these 3 dimensions is redundant. Because if I use, say, a 5 dimensional timeseries and I want to train it in a batched version, then I split that into junks. If the total timeseries had 1000 timepoints and my batches have size 100, then I would only require input tensors of (100, 5) as I understand it… But I would feed 10 of these tensors during one epoch to the LSTM

You may have to google how RNN works but basically you have TIMExBATCHxN_FEATURES.
If you have a sample that sample can be size 100 or size 1000. It doesn’t mean that you split your timeseries into samples. Each sample is a totally different thing.

You are also forgetting that your sample may have different feature lengths, eg, a video is a HxW feature space, a sentence may be a onehotvector of Q elements.
Also, length T is not fixed for each sample, how may have a video of 10 seconds and another one of 4.

Practical examples,
you encode a word dictionary in a one hot vector of 100 elements. Thus your n_features is 100.
Now you have a book, each sentence has a different length.
for sentence one you may have 5 words --> 5x100
Sentence 2–> 9 words --> 9x100
and so on up to sentence 4
If you get a batch of sentences you will have Tx[S1…S4]x100 but T is different for each sample.

For a video you have have different lengths also.
You n_features can be HxW
Each video contains TxFPS frames
if you have 5 videos you would get [TxFPS,5,HxW] features

Okay so maybe my confusion comes about because I am looking at a special case. In my case I just have a long timeseries that represents a long trajectory of some object. I dont have any predefined division of that timeseries into subsections like e.g. sentences of a long text. So I can choose that division myself and I want all subsection to be of same length. And that length is my batchsize. I.e., I have a 5D timeseries of 1000 recordings and instead of plugging it in as a whole I will draw randomly samples of size 100 (=batchsize) and plug it into the RNN. In that case what would be the respective seq_len, batch, input_size?

Soo let’s say that you are using absolute cartessian coords. Your 5D tensor is BATCH,T,X,Y,Z.
From here onwards let’s consider a sample as one segment of your long timeseries.

In order to feed this into an RNN you have to consider that each position of each sample at each time t is defined by [Xt,Yt,Zt], these would be your features. Whether these features are adequate or not to represent your data in this problem is not my concern.

Let me mention that you could consider the whole trajectory as a single sample. Assuming you want to split them. Following your case your input should be T,BATCH_SIZE,[X,Y,Z]
Here

T --> arbitrary time resolution chosen by you
N_SAMPLES = Timeseries_length/ T
BATCH_SIZE <= N_SAMPLES 

LSTM docs
Realize your input_size is 3 [XYZ] at the time of instantiating the RNN module.

What is the proper T? You have to struggle to discover it. Think that an RNN expects to encode the information of T events into the hidden state. The larger your sequence is the more parameter the hidden state needs to encode that information. There is also a limit on how many information the hidden state can encode efficiently. (Note that translators work well for short sentences but tend to fail more for larger)

Realize that you don’t have specify T to the RNN instance, T can be variable even if you fix it. You have to specify input_size, which is 3 assuming spatial coordinates, but it could be whatever length according to the representation you use.

The hidden state can also have an arbitrary size, it’s a hyperparameter as it’s a representation of your data.

The last but not the least is you don’t really need to precompute splits of a fixed length, you can take random crops of variable length in order to make your network more robust.

Hope it helps!

Yes, thanks a lot for your explanation!