Best practices for batching data for stateful LSTM and text generation

Yes, this organization of batches should be the correct way to go if you want to “preserve” the hidden state between batches. You probably want to detach() the hidden state though to avoid your computational graph exploding.

You can probably implement this using a Dataset and a Sampler. Here is a basic example that uses this approach to create batches where all sequences in a batch are guaranteed to have the same length. As you can see, the relevant logic is in the Sampler class.