Effect of padding on RNN hidden state

This is more of a theoretical question. I am wondering about the consequences of processing a padded batch with an RNN. I know that no weights will be updated since the gradient for zero’ed inputs will be zero. However, as far as I understand the hidden state will be updated, since it involves operations with the previous hidden state, which is not zero’ed. This means that when you start processing the next batch, the input hidden state will not be the one you expect (i.e. the one that was output right after finishing the previous sentence).

Please note that I am aware of the support for variable length using pack_padded_sequence. However, I have in mind cases where sorting sentences by length would actually disrupt the training data, since you need to respect original sentence order (example, language modelling) and I need to process the dataset sentence by sentence (unlike the word_language_model.py example, where the dataset is processed as a single vector).

Of course, you could use the {RNN,LSTM,GRU}Cell implementations and pick the hidden state computed at the last step before the padding, but then you will be incurring a speed penalty for using the {RNN,LSTM,GRU}Cell implementations.

1 Like