What is the use of returning lengths?
I think it can be better understood by an example. Suppose we have a problem where I need to classify whether a video is a teaching class or a person practicing a sport. Videos are composed of a set of frames, where each one is an image. So, one way to go is to pass the set of frames through an LSTM model. Now suppose every frame of our video have
torch.size([C, H, W]), where C is the RGB channels, H is the height and W is the width of the image. We also have a set of videos, in any case, every video might have a different length; therefore, a different number of total frames. For example, the video1 have 5350 frames while video2 have 3323 frames. You can model video1 and video2 with the following tensors:
torch.size([5350, C, H, W]) and
torch.size([3323, C, H, W]) respectively. As you can see, both tensors have different sizes in the first dimension, which prevents us from stacking both tensors in only one tensor. To make this happens, we can save a tensor called
lengths = [5350, 3323] and then pad all videos tensors with zeros to make them have equal length, i.e., both have the size of the biggest length, which is 5350, resulting in two tensors with the following shape:
torch.size([5350, C, H, W]). Then, after that, we can stack both tensor to obtain only 1 tensor with the following shape:
torch.size([2, 5350, C, H, W]), which means that 2 is the
batch_size (you can stack them with this function). But, as you can see, we have lost the information on the sequence when stacking both tensors, which means that for the tensor of video2, all examples of
video2_tensor[3324:, ...] will have 0 as values. To remedy this, we need to use the
lengths vector to get the original sequence back, and not a bunch of zeros.
also, how would you use the built in
torch.nn.utils.rnn.pad_sequence in your example
Yes! You could use it, and your code seems fine to me. But why the
mask = (batch != 0).to(device) line?