I have some video clips with different length. The video clip length is stored in seq_len = [T_1, T_2, …, T_n]. The video clips are stored in a tensor of shape [\sum T_i, C, H, W].
Now I want to converted it into shape [B, C, \max T_i, H, W], i.e., pad the shorter video clips with zero and concatenate them into the shape that can be sent to nn.Conv3d.
Here is an implementation using loops:
def pack_x(x, seq_len): T = seq_len.max().item() _, C, H, W = x.shape x = x.transpose(0, 1) x = torch.split(x, seq_len.tolist(), dim=1) pack_x =  for clip in x: clip = F.pad(clip, [0,0,0,0,0,T-clip.size(1)]) pack_x.append(clip.unsqueeze(0)) x = torch.cat(pack_x) return x seq_len = torch.randint(5, 10, size=) inputs = torch.randn(seq_len.sum(), 512, 28, 28) y = pack_x(inputs, seq_len)