A question for batch-training RNN

citystrawman · August 16, 2025, 3:29pm

I just started learning NLP and RNN. I read O’Reilly’s Deep Learning: natrual language processing, and here’s the Figure showing batch-training in RNN:

（please ignore its Chinese Characters which are not important）

In the above figure, the author uses Truncated BPTT as an example to show how to traing from a 1000-word time series, truncated at every 10 words, with batch size=2.

As the figure shows, the first sequence in first batch from X0, and the second sequence in first batch start from X500, which shift from X0 by 500. So does the other batch.

Then here comes my question: does this mean that all sequences in the first batch has no hidden state to inherit? For example, X500 does not have h499. What is more, does that mean that forward propagating is truncated at X500 (which indicates that the longest memory in this situation could only support 500 words)?

J_Johnson · August 18, 2025, 4:19pm

No. If you’re truncating, the first in the new truncated sequence gets zeros for the hidden state at t0, same as the initial sequence. Then that new hidden state moves on to t1, and so on.

citystrawman · August 19, 2025, 1:52am

So, can I assume that if we use batch-training with batch number N, we just “separate” the whole texts into N parts which are not “connected“?

J_Johnson · August 19, 2025, 2:50am

If you’re training set is one continuous document, then yes. But it’s probably a good idea to pad the beginning with a random amount of tokens between each epoch, in order to make those ‘cutoffs’ shifted and variable.