I am wondering what is the current best practice for variable length image sequences and CNNs.
I want to encode my variable-length image sequences using a CNN and then feed them into a RNN.
To enable batch-processing I am padding my image sequences with zero-frames. Then I just stack the sequence along the batch dimension and feed into the CNN like this:
batch_size, sequence_length, channels, height, width = image_sequence.size() # CREATE EMBEDDING # Transform BxSx... to (B*S)x... so it can be feed into the CNN image_sequence = image_sequence.view(-1, 3, height, width) embeds = self.cnn(image_sequence) # Transform back from (B*S)x... to BxSx... embeds = embeds.view(batch_size, sequence_length, -1)
Now for RNNs http://pytorch.org/docs/master/nn.html#torch.nn.utils.rnn.pack_padded_sequence exists, but my CNN still has to be run for a lot of padding-frames. Is there something like pack_padded_sequence for CNNs?