Using output of a pack_padded_sequence


(Utsav Garg) #1

Hello,

I am passing a pack_padded_sequence to a RNN and want to feed the mean output from all time steps to a Linear layer, how can I do this so that the padded portions are not included in the mean and the gradients are computed correctly?

I have defined the pack_padded_sequence, RNN and Linear layer as follows:

self.rnn = torch.nn.RNN(input_size=feature_dim, hidden_size=self.hidden, num_layers=self.num_layers, batch_first=True)
...
self.fc = nn.Linear(self.hidden, self.num_classes)
...
packed = pack_padded_sequence(base_out, lengths, batch_first=True)

Thanks.


(Zili Huang) #2

I met the same problem. Someone told me to sum up and divide the length, but I think it is incorrect, because the length should not be a Variable.


(Nick P) #3

@utsav @HuangZiliAndy did you figure this out? I am facing a similar problem. I have a hierarchical model which is a combination of a convolutional sentence encoder followed by an GRU. The GRU input is padded to the maximum sequence length and I need to unpack it and pass the output to a linear there to create a binary outcome of each item in a sequence (it’s a synchronous many to many RNN). However I don’t want to calculate gradients for the outputs that are padding. Any help appreciated.


(Barry Plunkett) #4

Also want to bump this. Facing the same problem right now.


(Nick P) #5

Yes I did. If you’re using cross entropy loss you can use the ignore_index parameter e.g. -100 and then when padding your outcome tensor (y) simply also pad with -100,then it knows not to calculate any gradients. We’re actually using binary cross entropy loss which doesn’t have ignore_index implemented, but we cheated by passing a weight vector with 0 weight for the padded outcomes.


(Barry Plunkett) #6

Thank you so much, Nick! I had come up with a very gross workaround, and I wasn’t 100% sure it worked.

To be clear, if the outcome tensor(y) is padded with -100 in the locations that correspond to the padded sequence locations, then it doesn’t matter at all what the score values are in the corresponding locations in the input tensor?


(Nick P) #7

That’s correct. Providing your padding value matches what you supply to the ignore_index parameter on the loss function then you’re good. It’s called Masking.