I implemented a layer-normalized LSTMCell from scratch. Everything works fine but it is much slower than the original LSTM. I noticed that the original LSTMCell is based on the LSTMFused_updateOutput which is implemented with C code. I am wandering if there is some easy way to speed up the LayerNorm LSTM without modifying the C implementation in the backend? Thank you very much!
Sorry to bring up a dead old thread, but I’ve recently reimplemented a LayerNorm LSTM (using the code above). Even with the suggestion to use the fused backend, I’m getting some pretty horrible speeds - about half the speed of the native LSTMCell implementation.
I suspect there aren’t any other ways to get a speedup, and that most of the difference is due to the native implementation being able to call the CuDNN optimized LSTM implementation directly. Is there a way to get Layer Norm into the CuDNN LSTM implemenation?
@SimonW I am sorry, I was not aware that pytorch had LSTM/GRU with layer norm built into it. I could not find it. Can you please point me to it? Thanks a lot.
As of 1.0 the fused pointwise backend is no longer importable. This is causing some pretty bad regressions in my model performance - is it possible to fix this? Even rewriting the LayerNormLSTM in Torch Script is a bit slower than it was before.
If you’re looking for a fast layer norm LSTM written in CUDA, you can try Haste (https://github.com/lmnt-com/haste). I’d love to see how well TorchScript’s performance compares – it would be really nice to have a flexible high-level approach that matches straight-up CUDA code.