Time series LSTM: Size mismatch beginner question

Beginner here so please bear with me. I’m adapting this LSTM tutorial to predict a time series instead of handwritten numbers.

In the original problem (using MNIST) there are 60000 28 * 28 images that are used to train the network. These get reshaped into a 28 * 60000 * 28 tensor to be ingested by the model.

My original data is a one dimensional time series with shape (40000, ). With a batch size of 20 I reshape to (5, 8000, 1) tensor corresponding to (timesteps, batches, features).

I’m trying to build an LSTM that takes 5 timesteps and predicts the “next” one, using a hidden layer of dimension 128 (5 --> 128 --> 1), but I’m getting a mismatch when I run the code. I can solve the problem but I don’t quite get what is going on.

Here’s my code and mock data.

I’m getting the following error:

RuntimeError: size mismatch, m1: [20 x 1], m2: [5 x 512]

20 is the batch size I defined
1 is the sequence length (number of features, only time-series itself at this point)
5 is the number of timesteps
512 is 128 * 4 but not sure where this comes from (why have 4 times the dimension of the hidden layer?)

So obviously if I change the sequence length to 5 it works, but I’m confused because then I would have an input tensor with shape (5, 1600, 5) and not the desired (5, 8000, 1).

The new shape doesn’t seem right because I want to take 5 data points to predict the 6th, so it should be a 5 x 1 vector that maps to an scalar, not a 5 x 5 grid that maps to a scalar (like the 28 x 28 grid in the original MNIST code).

What am I not getting?

Thanks for any insight.

1 Like

I’m not going to look in detail at your problem, but I do have the following superficial reacitons:

  • first one is, for the error message, good to include the full output, with all the stack trace, line numbers and so on. since this can be quite long https://gist.github.com is quite good for this
  • I notice you are suggesting that ‘sequence length’ and ‘number of features’ are conceptually the same. Without looking at your own code, generally speaking:
    – sequence length tends to correspond to the number of time steps you’re going to forward/back propagate over. For example, if you are using a char-level rnn, to predict the next character, and your input data is ‘welcom’, and the label is ‘e’, the sequence length here is 6: the 6 letters of ‘welcom’
    – number of features corresponds to the number of dimensions of the input at each time step. if you’re feeding in one-hot characters, this is the number of possible characters, typically something of the order of 60-100 or so
    – but the RNN itself has hidden layers, with possibly a different number of features
    – and the output can have yet another number of features
1 Like

I’ll post errors appropriately for future posts, thanks.

Thanks for your input, very helpful. Let me recap what I understand so far to see if its clear.

LSTM expects by default its input as a tensor of form (seq_length, batch, input_size).

seq_length is the size of the window of timesteps I want to use to predict the next timestep, right? For example if I want to use 5 previous observations to predict the next one it would be seq_length = 5.

batch is the number of samples.

input_dim is the number of features at each timestep (so at each step of seq_length) and corresponds to the number of features for each timestep. If I only have the time series this would be 1, but if I add e.g. one-hot hour of the day this would be a 1 + 24= 25 length vector for each timestep.

Two questions:

  1. Is this accurate?
  2. If it is: when the input is a matrix (say 5 steps x 25 features) does this affect the way I instantiate nn.LSTM? For example nn.LSTM((5,25), hidden_dim)?

As far as number of features, generally speaking, if your feature is a 1-of-k thing, like say a letter, or a word in a vocabulary, you’d use one-hot encoding. This means number of features = number of classes. I’m not sure I understand what you mean by ‘the timeseries only has one feature’.

I mean that there are no additional features to predict the time series other than the series itself (so no metadata of sorts, like hour of the day, day of week etc). As I’m trying to understand LSTM it makes sense to me to first try a simple prediction exercise using only the series itself (predict t+1 from previous timesteps) and then improve accuracy by adding metadata to each timestep (predict t+1 from previous timestep with metadata for each timestep).

Oh I see. You’re trying to predict if the next pixel is 1 or 0? Therefore, just one feature, which can be 1 or 0?

I have a time series of stock prices and I want to predict the stock price at time t+1 using previous observations. To begin I want to predict only using the previous observations, and once I finally have a working model improve model accuracy by adding features to each timestep (for example using other related sock time series as features). Does this make sense?

Ah. It is a real-valued feature. Interesting. In that case, yes, the number of features is 1, and if you have 5 previous observations, the seq_len is I believe 5.

Thanks for your time, at least that is now clear in my mind :slight_smile:

EDIT: But still getting the same mismatch error.

Here’s the traceback of my error. The actual error reads RuntimeError: size mismatch, m1: [20 x 1], m2: [5 x 512].

I’ve been able to track partially the source of the errors to the following:

The mismatch comes from the dot product

(batchsize x input_size) * (seq_length x 4*hidden_dim)

which is the dot product of the data batch with c0/h0 in the forward step. (I think 4*hidden_dim corresponds to each one of the weight matrices for each one of the soft logical gates i, f, o, and g).

As far as I can tell, I’m passing a properly shaped tensor to LSTM (seq_length, num_samples, input_size) or (5, 20, 1) and c0/h0 seem to be ok also like so:

h0 = Variable(torch.zeros([1, 20, 128]), requires_grad=False)

Help? I’ve been thinking about this way too long I need a drink. Peace.

Unless anyone else answers, perhaps you can write a short 5-10 line example, that reproduces the error you are seeing, using torch.rand(…) in place of actual data? Try to make the xample as short as you can. Using names for the various constants, like seq_len, batch_size etc, will be a plus.

Here it is.

class LSTMNet(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(LSTMNet, self).__init__()
        self.hidden_dim = hidden_dim
        self.lstm = nn.LSTM(input_dim, hidden_dim)
        self.linear = nn.Linear(hidden_dim, output_dim, bias=False)

    def forward(self, x):
        batch_size = x.size()[1]
        h0 = Variable(torch.zeros([1, batch_size, self.hidden_dim]), requires_grad=False)
        c0 = Variable(torch.zeros([1, batch_size, self.hidden_dim]), requires_grad=False)
        fx, _ = self.lstm.forward(x, (h0, c0))
        return self.linear.forward(fx[-1])

seq_length = 5   # Number of timesteps for prediction.
input_dim = 1    # Number of features
hidden_dim = 128 
batch_size = 20
output_dim = 1   # Predict a real-value feature

x = Variable(torch.rand(seq_length, batch_size, input_dim), requires_grad=False)
model = LSTMNet(seq_length, hidden_dim, output_dim)
model.forward(x)

If I change input_dim=5 it obviously works. It’s almost as it it expects an nxn input by default.

Thanks for your time, again.

I have stopped reading when I noticed you have an inconsistency in the naming of the first paramter of your LSTMNet consturcto input_dim, and your initialization of that parameter, which is seq_length. I think it would be good to fix such inconsistencies, keep everything tidy.

That might be part of my confusion. From what I understand from our convo above and another thread I started, seq_length would be the size of the window I want to use to predict, in my case 5. If I use input_dim=1 in the LSTMNet constructor I will get a model that takes a sequence of length 1 --> 128 --> 1, which would take the last observation (a window of size 1) to predict the next step, and not 5 like I want.

Unless I have it backwards and seq_length is the number of features (1) and input_dim the window size (5)?

seq_length is the window size, 5 in your case. input_dim is the number of features, 1 in your case.

So input_dim is a single feature of length seq_length? Is that the way to interpret it? That’s why the properly constructed model would show:

LSTMNet ( (lstm): LSTM(1, 128) (linear): Linear (128 -> 1) )

So the 1 in (lstm): LSTM(1, 128) is not the size of the window (5) but instead the number of features? Features which have length of seq_length (5)?

Not sure if this is useful or not? Anyway, it’s not entirely unrelated :slight_smile: . https://www.youtube.com/watch?v=6WdLgUEcJMI&feature=youtu.be . "Create pytorch rnn functor, pass random input through it "

1 Like

Carrying on from this, training to memorize a sequence of integers, and handle a bunch of embedding/dimension mismatch issues on the way https://www.youtube.com/watch?v=MKA6v99uYKY&feature=youtu.be . It is 1am in the morning as I recorded this. Not sure how noticeable that is :smile:

hi, i am happy to see your video on youtube and if you don’t mind, can you please share your code of “Train pytorch rnn to predict a sequence of integers”? thank you very much.