It might interest you to know that I’ve been trying to do something similar myself: Confusion regarding PyTorch LSTMs compared to Keras stateful LSTM
Although I’m not sure if just wrapping the previous hidden data in a torch.Variable ensures that stateful training works