Are you using nn.LSTM by looks of it? U would want to use nn.LSTMCell for this
Just out of curiosity, what is difference, and difference in use-case, between LSTM and LSTMCell?
I think LSTMCell = do your own timestep, LSTM = pass the entire sequence. Usecase: use LSTM unless you have a reason not to, reasons are things that LSTM/cudnn don’t support, e.g. attention, teacher forcing. The OpenNMT-py uses both in en- and decoder.
LSTMCell is an unrolled lstm. And what you want here is a stateful lstm which is much more easily made using an LSTMCell
So, I had lots of trouble with this issue few months ago for my PhD.
To be exact, what I had is lots x, y of a pen trajectory (continuous data), drawing letters. I have all the alphabet recorded for 400 writers. What I wanted to let the LSTM (GRU in my case) generate letters. Straightforward? Not at all!
- I trained the model in prediction (predicting the next step), and then used it to generate the letters. I was inspired by the same approach Andrei Karpathy did in his blog
The prediction result was perfect! However, when I generate (not again, it is continuous data. Karpathy use categorical data - characters -), it always flats out (literally saturating at some value very quickly). Whatever letter, for whatever writer style, always the result is shit!
- I then tried something very simple. Just learn (or remember technically) a continuous sine wave. That took 2 weeks to resolve (just to make the GRU memorize). Refer to the discussion in this thread
LSTM time sequence generation
and this question in stack overflow
The whole trick was to ‘increase’ the sequence length enough during the training, to enable the algorithm to capture the ‘whole’ pattern. Remember, this is just to memorize, not to ‘learn’.
My conclusion was: discretize discretize disctretize
- I found this super awesome paper from Alex Graves
where he solves this problem (but on a different dataset). He uses something called
Mixture Density Network. It is really a beautiful thing! He managed to handle continuous data neatly.
I tried to replicate his architecture in pytorch (will, to be fair, I didn’t try hard enough), but it is very unstable during the training. I tried to stabilize in many ways, but I had to stop pushing in this direction shortly (you know, supervisors are moody ! )
- Following an advice in the thread mentioned in point 2 (to make the model sees its own output), I tried it (still believing that continuous data will work…). This is similar to what @Kaixhin suggested. Although the idea make sense, it is a big big issue on how to do it!
In continuous domain, nothing happens (literally nothing). The results is still shit.
To get some idea on how to do this, take a look at this paper - called Scheduled Sampling - (but the paper implements for discrete data - of course! -)
Then, this awesome guy came, proving that this scheme 'of feeding the model its own output, can lead to problems! (and if the model at any point of time sees just its own output, it will never learn the correct distribution!!)
So, the guys who did the paper of scheduled sampling did a new paper, recognizing this awesome guy neat paper, in order to remedy the problem. It is called 'professor forcing’
To be honest, I don’t like this paper very much, but I think it is a good step in the right direction (note again, they only use categorical data - finite letters or words -).
In short, how to make the model take into own distribution error into account is still an open issue (to the best of my knowledge).
- Reaching huge amount of failure and frustration at this point, I decided to discretize the data. Instead of having x, y, I used another encoding (Freeman codes, no need to go to details).
I used the architecture mentioned by Karphathy in his blog + ‘Show and tell’ technique to bias the model (since I’ve the same letters - labels - from different writers)
It worked neatly!! I am super happy with it (even though you lose important info when using Freeman codes, but we have fkg letters! (and many are really complex, and beautiful)
- I was feeling confidence at this point, so I decide to add ‘distance’ to be predicted with Freeman code - I discretized the shit out of it - (forgot about the details. Just imagine i want to predict two different random variables instead of just one). I don’t fuse the modalities. I have two softmax at the end, each predicting different random variable.
That is super tricky! In prediction, it works perfectly. In generation, it is a complete shit!
After some thinking, I came to realize the problem. Assume the LSTM output is
h, and freeman codes variable is
R1and the distance variable is
R2. When I train, I train to model
P(R1 | h) and P(R2 | h), but not
P(R1, R2 | h). When you sample like what Karpathy is doing, you use a multinomial distribution for each softmax. This lead the model to follow two different paths for R1 and R2, which lead to this shit (in short, with this scheme, you need
P(R1, R2 | h). I am still working to solve this problem at the moment.
In short, my advice is (if you need something quick):
- Discretize (categorize) your data
- Use Karpathy approach (mentioned in his blog)
- If you need to bias the model, there are many techniques to do that. I tried
Show and Tell. It is simple, beautiful and works well.
- If you want to predict more than one variable, it is still an issue (let me know if you have some ideas)
If you have time, I suggest you look at ‘Alex Graves’ approach (I would love to see an implementation for this MDN in pytorch).
One last thing (i think you can already guess by now): Good prediction result doesn’t mean at all good generation results. These are different objectives. Generating sequences is done by somehow ‘tricking’ the system that is trained on prediction.
In short: Good prediction != Good generation
Good prediction + following the state of the art practices (mostly, discretize) ~= Good generation --> ~= mean ‘hopefully equal’
Great post, thanks for all the references and ideas!
I’ll have to look carefully at them but just as a first thought, looks like discretizing a time-series would represent data at each timestep as a one-hot vector (indicating the discrete bin where the real value falls). So instead of having a timeseries I would have a grid of T x R, where T is the set of timesteps and R is the number of the discrete steps that partition the range of my time series. Thoughts?
That looks good to me.
I recommend you take a quick look at Karpathy blog and the way he present the data to the network and how he samples from the network, it should clarify any ambiguities you’ve about this issue (it is a fun and interesting read)
hardmaru reimplemented the MDN stuff from one of his earlier blogposts in PyTorch: https://github.com/hardmaru/pytorch_notebooks/blob/master/mixture_density_networks.ipynb
How you have set up you have your only using the LSTM to changes from one data point to the next data point in that sequence. But doesn’t get any information about the sequences before it. Hence the cell state needs to be input for sequence to sequence in order to take full advantage of LSTM so that LSTM encompasses not just what happens from 1 step to the 20th step of data point in 20 datapoint sequence but encompasses all the datapoints from 1 to 30,000 whatever number of original datapoints you have. Hence the data can not be parallelized and cannot be shuffled as the output from each point is needed before training on next input and output.
if your training just 20 datapoints to one future point and only using LSTM for use on order of sequence from 1-20 and then not carrying cell state forward you might as well just use a regular mlp because you are getting no advantage using LSTM
I hope that helps
@Kaixhin Thanks for the link!
Sorry, I didn’t clarify enough. The MDN itself is not hard to implement as a concept (will, it depends on which distribution you will use for MDN). Since Alex Graves uses Gaussian Mixture Models for his MDN. With me, GMM required lots of work, and still quite unstable (however, I didn’t debug it further)
Hmm, I am not quite sure I agree (to be fair, I didn’t try this before). LSTM is not a magical tool at the end, it has a limited capacity. If you propagate the hidden state to the entire sequence (30,000), this hidden state isn’t a good enough representation to memorize such super long dependencies. Theoretically, it can model arbitrary length of sequences, but practically it doesn’t (this of course depends a lot on the data and the dimensionality that you have) (we ‘implicitly’ observe that even for image captioning, the LSTM can forget that it produced some words already. we are talking here about short sentences after all. To be fair, we didn’t explore this phenomena in depth. This is meta-analysis).
Also, if you use just 20 points, you are still taking advantage of the LSTM. The cell state is carried from one time step to another (from one point to another), so this is not equivalent to MLP at all.
A possible enhancement maybe to increase the sequence length.
Yes but he is doing 20 datapoints to make “one” prediction. An mlp would suffice for this. This data is just one float number per point in time series so 30,000 points does not constitute a lot of data. Maybe I’m bad explaining this so here is a link with a good explanation of using a stateful LSTM:
Thank you for the link, I will check it. I am interested to knew more about this concept (I don’t have this case in my work)
no problem. and to do something like this in pytorch you would just do something like:
output, (hx, cx) = model((input, (hx, cx))
and then in def forward:
def forward(self, inputs):
x, (hx, cx) = inputs
x = x.view(x.size(0), -1)
hx, cx = self.lstm(x, (hx, cx))
x = hx
return x, (hx, cx)
obviously a lot of other stuff in there for your desired outputs but thats the underlying basics to it
Ok now I’m confused
Let me backtrack a bit to make sure I understand what you’re saying. You’re right because I’m reshaping my timeseries into a bunch of sequences of length q and use each one to predict the “next” observation. My input (and ground-truth) data is organized sequentially (timesteps 1-20 to predict 21, 2-21 to predict 22 etc). I then run nn.LSTM model with input size (20, batch_size, 1) to predict that “next” value. Now, you’re saying this is memoryless, and you also mentioned using nn.LSTMCell to create a stateful lstm in an earlier reply, are these two things related? This is where my confusion lies because I can’t see how my setup is preventing cell state to be carried from one time step to the next (I guess I just assumed nn.LSTM would automagically do this because… its an LSTM ).
its not completely memoryless but it only useful for what happens from say i.e. 3-22 but has no recollection of what happened at 1 and 2 to help predict 23 and can only draw on information from 3-22. Have you seen this repo?:
It should help you
I say lstmcell cause its more intuitive cause its unrolled so output ready for next input as you can just pass cell state from rolled lstm to next lstm sequence as it should be only final part of that rolled up lstm
and yes u would think automatically work this way but for lots of other sequences this not advantageous i.e… translating a sentence, the order of words on previous sentence should have no bearing on how you would translate current sentence
I did see the example a few weeks ago but its more useful now that I’ve been learning a bit more. If I replicate this method with the time series I posted in the OP I get the same results:
The prediction in the sine wave example is done like this:
for i, input_t in enumerate(input.chunk(input.size(1), dim=1)): h_t, c_t = self.lstm1(input_t, (h_t, c_t)) h_t2, c_t2 = self.lstm2(c_t, (h_t2, c_t2)) outputs += [c_t2] for i in range(future): # if we should predict the future h_t, c_t = self.lstm1(c_t2, (h_t, c_t)) h_t2, c_t2 = self.lstm2(c_t, (h_t2, c_t2)) outputs += [c_t2]
I can see the final state
c_t2 in the first loop is used as input for the first predicted pairs
c_t in the second loop. The overall logic seems to be: take your last cell state and use it as input along with the second to last cell/hidden states, calculate new cell/hidden states, use these as inputs, repeat.
c_t in the first cell instead of
c_t2 if these are the last cell/hidden states? I’m also confused about why there are two time steps involved. I’d appreciate if someone can break down the logic a bit more.
Ok so in that example you generate 100 sine waves consisting of 1000 step points.
To train he takes 97 of these sine waves and uses 999 step points to predict final 1000th values and trains on these using loss between output and target.(key here is that he predicts 999 outputs t2-t1000 not just the t1000 output)
As he finishes each epoch he test on the final 3 sine waves left over predicting 999 points but he also then uses last output c_t2 to do future loop to then make the next prediction but also because he also created his next (h_t, c,_t), ((h_t2, c_t2) in first iteration so has all he needs to propogate to next step and does for next 1000
“Why use h_t, c_t in the first cell instead of h_t2, c_t2 if these are the last cell/hidden states?”
He is using those for the future predicts remember in final test he does those first 999 loop those values still there to use and then does a 1000 future predictions
h_t, c_t For lstm1
h_t2, c_t2 for lstm2
Also, I think what can get confusing here is that this is during the forward the pass and the whole network is based off of manipulating this cell state in temporal order to get more accurate via training with Back Propagation Through Time