Has anyone tried to implement both batch learning and truncated BPTT with variable length sequences?

Actually I’m using Packed Sequences but I’m not completely satisfied. I’ll show you my method and the issues with this approach:

Take `k=k1=k2`

the k of truncated BPTT and `m = maximum sequence length`

. If necessary, I pad the sequences in order to have `m%k = 0`

.

Take `trunc_iter = m/k`

the number of iterations of truncated BPTT (with the new `m`

!). Then I compute the sequences’ lengths for each iteration `trunc_iter`

. It’s a bit tricky, I think the code will explain this better:

```
lengths = np.zeros((nb_sequences, m // k), dtype=np.int)
for i in range(lengths.shape[1]):
lengths[:, i] = np.minimum(k, seq_lengths)
seq_lengths -= k
lengths[np.where(lengths[:, i] < 0)[0], i] = 0
```

`seq_lengths`

is a variable in the outer scope which stores original the sequences’ lengths. At the end of this operation I will have computed the sequences’ lengths for each iteration of the truncated bptt.

At this point I have all I need to pack the sequences and feed them to the network. I start with the first batch and run all the `trunc_iter`

steps of the backpropagation. Then I feed the second batch and so on and so forth.

There are two issues:

- PackedSequences do not support 0 length sequences. So, if I have a sequence that is shorter than
`m - k`

I cannot use it or I have to change k. This is a major issue, especially because at the first step I pad the sequences and I can be very limited on the choice of k. - I usually put a linear layer after the RNN. Do the zeros in the padded sequence affect the gradient and the weights update? If so, how can I avoid that?

Also, if you have any implementation to share it will be very appreciated. Thanks.