What is the use of returning lengths?
I think it can be better understood by an example. Suppose we have a problem where I need to classify whether a video is a teaching class or a person practicing a sport. Videos are composed of a set of frames, where each one is an image. So, one way to go is to pass the set of frames through an LSTM model. Now suppose every frame of our video have torch.size([C, H, W])
, where C is the RGB channels, H is the height and W is the width of the image. We also have a set of videos, in any case, every video might have a different length; therefore, a different number of total frames. For example, the video1 have 5350 frames while video2 have 3323 frames. You can model video1 and video2 with the following tensors: torch.size([5350, C, H, W])
and torch.size([3323, C, H, W])
respectively. As you can see, both tensors have different sizes in the first dimension, which prevents us from stacking both tensors in only one tensor. To make this happens, we can save a tensor called lengths = [5350, 3323]
and then pad all videos tensors with zeros to make them have equal length, i.e., both have the size of the biggest length, which is 5350, resulting in two tensors with the following shape: torch.size([5350, C, H, W])
. Then, after that, we can stack both tensor to obtain only 1 tensor with the following shape: torch.size([2, 5350, C, H, W])
, which means that 2 is the batch_size
(you can stack them with this function). But, as you can see, we have lost the information on the sequence when stacking both tensors, which means that for the tensor of video2, all examples of video2_tensor[3324:, ...]
will have 0 as values. To remedy this, we need to use the lengths
vector to get the original sequence back, and not a bunch of zeros.
also, how would you use the built in torch.nn.utils.rnn.pad_sequence
in your example
Yes! You could use it, and your code seems fine to me. But why the mask = (batch != 0).to(device)
line?