Thanks for this great framework and amazing supports in this forum.
I have a question regarding a slightly more complicated usage of variable-length of input. Since I work on video understanding, I will use video as an example to explain what I would like to do.
Basically, I would like to develop a framework to work with variable-length of variable-length of inputs.
# num_frames: number of frames changes per video # num_features: each video frame has various number of features # feature_dim: feature dimension is fixed video: (num_frames, num_features, feature_dim)
If the batch size is 4 (corresponding to 4 videos), for each of the video frame within a video, I can first pad the number of features to make them consistent across the entire video sequence.
# each video has different length and each video frame has *padded* number of features video_1: (9, 50, 512) video_2: (10, 30, 512) video_3: (5, 20, 512) video_4: (7, 70, 512)
Once I have the batch, I can again pad the number of features for each video and then pad the length of videos (as below).
Since I have two layers of variable-length inputs, I wonder using the strategy of padding the sensors, getting their corresponding lengths, sort the tensors, and using
pack_padded_sequence will still make sense. A simple model can be an FC layer (512->256) shared across all video frames, followed by a MaxPooling, and finally, an RNN run through the video sequence.
# 2-layer padding # shared FC layer + MaxPooling # RNNs video_1: (10, 70, 512) ------> (10, 1, 256) -------> (1, 1, 256) video_2: (10, 70, 512) ------> (10, 1, 256) -------> (1, 1, 256) video_3: (10, 70, 512) ------> (10, 1, 256) -------> (1, 1, 256) video_4: (10, 70, 512) ------> (10, 1, 256) -------> (1, 1, 256)
While these operations seems to be straight-forward, I am a little bit worried about whether if the gradient flow will be calculated correctly, since it involved two-layer of variable-length of inputs.
I have read all topics in this forum regarding using variable-length, but previous questions focus on only one layer of variable-length (various length of sentence for NMT or various length for videos), whereas I have two layers of variable-length inputs.
I am half way of developing code for data loading and constructing my model, but I would be really appreciated that if anyone has suggestions or previous experiences that they can share with me.