LSTM time-series prediction

osm3000 · July 19, 2017, 1:19pm

Hi,
So, I had lots of trouble with this issue few months ago for my PhD.
To be exact, what I had is lots x, y of a pen trajectory (continuous data), drawing letters. I have all the alphabet recorded for 400 writers. What I wanted to let the LSTM (GRU in my case) generate letters. Straightforward? Not at all!

I trained the model in prediction (predicting the next step), and then used it to generate the letters. I was inspired by the same approach Andrei Karpathy did in his blog
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
The prediction result was perfect! However, when I generate (not again, it is continuous data. Karpathy use categorical data - characters -), it always flats out (literally saturating at some value very quickly). Whatever letter, for whatever writer style, always the result is shit!
I then tried something very simple. Just learn (or remember technically) a continuous sine wave. That took 2 weeks to resolve (just to make the GRU memorize). Refer to the discussion in this thread
LSTM time sequence generation
and this question in stack overflow
https://stackoverflow.com/questions/43459013/lstm-time-sequence-generation-using-pytorch
The whole trick was to ‘increase’ the sequence length enough during the training, to enable the algorithm to capture the ‘whole’ pattern. Remember, this is just to memorize, not to ‘learn’.
My conclusion was: discretize discretize disctretize
I found this super awesome paper from Alex Graves
https://arxiv.org/abs/1308.0850
where he solves this problem (but on a different dataset). He uses something called Mixture Density Network. It is really a beautiful thing! He managed to handle continuous data neatly.
I tried to replicate his architecture in pytorch (will, to be fair, I didn’t try hard enough), but it is very unstable during the training. I tried to stabilize in many ways, but I had to stop pushing in this direction shortly (you know, supervisors are moody ! )
Following an advice in the thread mentioned in point 2 (to make the model sees its own output), I tried it (still believing that continuous data will work…). This is similar to what @Kaixhin suggested. Although the idea make sense, it is a big big issue on how to do it!
In continuous domain, nothing happens (literally nothing). The results is still shit.
To get some idea on how to do this, take a look at this paper - called Scheduled Sampling - (but the paper implements for discrete data - of course! -)
https://arxiv.org/abs/1506.03099
Then, this awesome guy came, proving that this scheme 'of feeding the model its own output, can lead to problems! (and if the model at any point of time sees just its own output, it will never learn the correct distribution!!)
https://arxiv.org/abs/1511.05101
So, the guys who did the paper of scheduled sampling did a new paper, recognizing this awesome guy neat paper, in order to remedy the problem. It is called 'professor forcing’
https://arxiv.org/abs/1610.09038
To be honest, I don’t like this paper very much, but I think it is a good step in the right direction (note again, they only use categorical data - finite letters or words -).
In short, how to make the model take into own distribution error into account is still an open issue (to the best of my knowledge).
Reaching huge amount of failure and frustration at this point, I decided to discretize the data. Instead of having x, y, I used another encoding (Freeman codes, no need to go to details).
I used the architecture mentioned by Karphathy in his blog + ‘Show and tell’ technique to bias the model (since I’ve the same letters - labels - from different writers)
https://arxiv.org/abs/1411.4555
It worked neatly!! I am super happy with it (even though you lose important info when using Freeman codes, but we have fkg letters! (and many are really complex, and beautiful)
I was feeling confidence at this point, so I decide to add ‘distance’ to be predicted with Freeman code - I discretized the shit out of it - (forgot about the details. Just imagine i want to predict two different random variables instead of just one). I don’t fuse the modalities. I have two softmax at the end, each predicting different random variable.
That is super tricky! In prediction, it works perfectly. In generation, it is a complete shit!
After some thinking, I came to realize the problem. Assume the LSTM output is h, and freeman codes variable is R1 and the distance variable is R2. When I train, I train to model P(R1 | h) and P(R2 | h), but not P(R1, R2 | h). When you sample like what Karpathy is doing, you use a multinomial distribution for each softmax. This lead the model to follow two different paths for R1 and R2, which lead to this shit (in short, with this scheme, you need P(R1, R2 | h). I am still working to solve this problem at the moment.

In short, my advice is (if you need something quick):

Discretize (categorize) your data
Use Karpathy approach (mentioned in his blog)
If you need to bias the model, there are many techniques to do that. I tried Show and Tell. It is simple, beautiful and works well.
If you want to predict more than one variable, it is still an issue (let me know if you have some ideas)

If you have time, I suggest you look at ‘Alex Graves’ approach (I would love to see an implementation for this MDN in pytorch).

Good luck!