Variable-length of variable-length of input

Thanks for this great framework and amazing supports in this forum.

I have a question regarding a slightly more complicated usage of variable-length of input. Since I work on video understanding, I will use video as an example to explain what I would like to do.

Basically, I would like to develop a framework to work with variable-length of variable-length of inputs.

# num_frames: number of frames changes per video
# num_features: each video frame has various number of features
# feature_dim: feature dimension is fixed 
video: (num_frames, num_features, feature_dim)

If the batch size is 4 (corresponding to 4 videos), for each of the video frame within a video, I can first pad the number of features to make them consistent across the entire video sequence.

# each video has different length and each video frame has *padded* number of features
video_1: (9, 50, 512)
video_2: (10, 30, 512)
video_3: (5, 20, 512)
video_4: (7, 70, 512)

Once I have the batch, I can again pad the number of features for each video and then pad the length of videos (as below).
Since I have two layers of variable-length inputs, I wonder using the strategy of padding the sensors, getting their corresponding lengths, sort the tensors, and using pack_padded_sequence will still make sense. A simple model can be an FC layer (512->256) shared across all video frames, followed by a MaxPooling, and finally, an RNN run through the video sequence.

# 2-layer padding        # shared FC layer + MaxPooling   # RNNs
video_1: (10, 70, 512)  ------> (10, 1, 256)    -------> (1, 1, 256)
video_2: (10, 70, 512)  ------> (10, 1, 256)    -------> (1, 1, 256)
video_3: (10, 70, 512)  ------> (10, 1, 256)    -------> (1, 1, 256)
video_4: (10, 70, 512)  ------> (10, 1, 256)    -------> (1, 1, 256)

While these operations seems to be straight-forward, I am a little bit worried about whether if the gradient flow will be calculated correctly, since it involved two-layer of variable-length of inputs.

I have read all topics in this forum regarding using variable-length, but previous questions focus on only one layer of variable-length (various length of sentence for NMT or various length for videos), whereas I have two layers of variable-length inputs.

I am half way of developing code for data loading and constructing my model, but I would be really appreciated that if anyone has suggestions or previous experiences that they can share with me.

1 Like

pack_padded_sequence only applies to RNNs (nn.RNN, nn.LSTM, nn.GRU).

You can process your initial (shared FB layer + MaxPooling) without padding the feature dimension, and then once you pooled, you can pad the output and send it into RNNs as packed sequences.