How to process variable length sequence of images with CNN

Hi -

I have images in sequences of variable length. I am trying to first process each image with a CNN to get a feature representation. Once I have variable-length sequences of features, I will process each sequence through an LSTM. I know that I can pad the variable-length sequence of feature vectors with zeros and create a packed sequence with pack_padded_sequence() before sending it to LSTM. But how can I do the same to my data before sending it to CNN? I do not want to pad the image sequence with zeros and let the CNN waste computation on processing images of all zeros. Any thoughts?

The way I am currently hacking it is that: I am still padding the image sequence. Once I get a PackedSequence object using pack_padded_sequence(), I can call the ‘forward function’ of my CNN model on the ‘data’ attribute of the PackedSequence object. I can then manually construct a new PackedSequence object with the output from my CNN model, effectively replacing the data field of the old PackedSequence object. Finally, I send the new PackedSequence to the downstream LSTM. Is this an okay workaround?



I’m not sure what an image sequence is. Could you post the shape of your current input and also show how you are currently padding it, please?

Sure, each data sequence is of format [image_0, action_1, image_1, action_2, image_2, action_3, image_3…], and the task is to predict action sequence [action_1, action_,…, action_k] given image sequence [image_0, image_1,…, image_k]. The image sequence of a variable length k+1 in a batch, so I pad each sequence with zero images until sequence length is max_seq_len. The batched input is thus of shape (B, max_seq_len, C, H, W).

My network uses a CNN model to embed each image into a feature vector state, and then uses a LSTM model to predict the action sequence from the state sequence. To avoid embedding all the zero images that are just padding,

  1. I use pack_padded_sequence(images, image_seq_lens, batch_first=True, enforce_sorted=False) to produce packed_images.
  2. Run the CNN on to get packed_states_data.
  3. Instantiate (a hacked advised against) packed_states = PackedSequence(packed_states_data, packed_images.batch_sizes, packed_images.sorted_indices, packed_images.unsorted_indices)
  4. Send packed_states to the LSTM to predict packed_actions
  5. Unpack actions with actions, action_seq_len = pad_packed_sequence(packed_actions, batch_first=True, total_length=self.max_seq_len-1)

To calculate the loss between predicted actions and ground truth actions_gt, I also pack actions_gt into packed_actions_gt. I use assert (packed_actions.sorted_indices == packed_actions_gt.sorted_indices).all() to make sure they are sorted in the same way before packing. Then I compute the loss on and

My questions are that 1) are these valid operations? 2) is there going to be a performance issue (longer time) with numerous packing and padding back and forth on multiple GPUs? Right now my batched input live on GPU devices and these hacks happen there. Thank you for your time!

1 Like
  1. I think your general approach is correct. Alternatively to the padding you could also try to create batches with the same image_seq_lens, but depending on your data distribution this might not be easy.

  2. You should see a slightly lower performance for the packing and unpacking, but if there is no way around this, you wouldn’t have a baseline to compare against.

One thing I’m a bit worried about is the usage of the .data attribute.
You’ve mentioned that you are using and to compute the loss, which would detach these tensors (especially the model output) and you should see an error.
Could you remove the .data usage and try to calculate the loss directly using these tensors?

But packed_actions and packed_actions_gt are PackedSequence objects, how can I compute a loss on them?

Right now there is no syntax error, and model seems to be able to train a little bit. I originally thought that only access the data field of PackedSequence, and will detach the tensor from the graph.

If will detach the wrapped tensor, my approach outlined above will not work. Is there a way to access the tensors that PackedSequence holds while leave the tensors in the computation graph?

Thank you!

Yeah, I think you are right and the .data attribute is used in the PackedSequence, so skip my comment on the usage of .data in this context.

Hi @yanweiw can we have the full code that you are using to check the condition of the missing sequence and then padding the same to that sequence and unpacking the same at the LSTM end?

1 Like

The Idea behind this is to avoid useless convolutions on padded frames right? How do you ensure the convolutions are done correctly on the