according to my understanding of GRUs, extending a sequence with zeros (-> sequence padding) should not make a huge difference in the final output, as long as the padded length is not too long.
I had a few problems in my latest network, until I figured out, that the difference between the several sequences was too big.
My first idea was to sort the sequences by their length, so that the total length difference in a padded batch would be as small as possible. Now I’m wondering whether there is an even better way to prevent the sequence padding to falsifie the networks output (by too much).
Thank you.
It is not necessary to use outputs produced using padding zeros.
if you’re using hidden state as RNN result, packed sequences support variable length sequences
if you’re using regular output (with time dimension), you can create a loss mask from sequence lengths, so that padding locations produce zero gradients.
The BucketIterator creates batches with sequences of the same or similar lengths. Or you can create your own Sampler to create batches with sequences of the same length; see this post as an example.
The comments from @googlebot are of course valid, but it’s generally a good idea to avoid batches where the lengths of the sequences vary a great deal.
Thank you for your replies!
I guess Chris’ answer is the more general solution.
However, in my case, I think masking should work better. Just for being sure: using the regular output, I should adjust my targets so that they are of size (n_batches, padded_len) by filling them with sequence_len times the class_index and pad them together. So assuming I use pad value 0 for my targets, I should then use ignore_index=0 in my Loss?
I’m just asking this as I’m a bit curious about the targets.
I haven’t tried ignore_index, that method will probably work. More general way is to use reduction=‘none’, multiply padded loss cells by zero and reduce loss tensor using weighted mean (i.e. divide loss sum by number of good cells).
I see your point. But according to the docs, these two methods should not make any difference, should they?
ignore_index (int, optional) –
Specifies a target value that is ignored
and does not contribute to the input gradient.
When size_average is True,
the loss is averaged over non-ignored targets.
Setting ignore_index to the padding value of the targets, their losses will be ignored. It also says the with size_average=True, the mean is based on the non - ignored losses.
However, I’m curious how this behaves with reduction as size_average is going to be deprecated.