Sequence packing in hierarchical RNN model


I am implementing a Hierarchical GRU model for document classification, described as follows

I first run a word level model (GRU). The input of the word level neural network is the kth sentence, tokenized and padded, for every article in the batch. I then compute the output for every k and concatenate the outputs together (to be fed into a sentence level GRU).

and I have encountered the following problem:

for example batch size = 3 and I am looking at the 6th sentence (k=6). If the third article only has 5 sentences, the input, after sorting by length, looks something like:

[1,56,67,8,90](6th sentence in article 1)
[7,123,6,9,0](6th sentence in article 2)
[0,0,0,0,0](6th sentence in article 3)

I cannot pack the third sentence, since sentence length is 0.

The way I can think of is to remove the zero sequences before feeding the batch to the GRU, and adding the zeros vectors to the output of the GRU. (A bit like how we sort and unsort the sequences by length).

I’m not sure if this is an efficient way, or whether this works with autograd.
Is there a better way to get around this problem?