PytorchVideo - AvgPool3d has reduced the sequence length (From 16 to 13)

jpainam · September 1, 2021, 3:00am

Hi, I’m working with the new library PytorchVideo. And I’m using ResNet3D
My input sequence for a single batch size is torch.Size([1, 3, 16, 224, 224]) and I was expecting to get after the average pool layer torch.Size([1, 2048, 16, 1, 1]) but I got torch.Size([1, 2048, 13, 1, 1]). I can’t figure out how the sequence length changed from 16 to 13.

Any helped?
I verified the inputs of that layer is torch.Size([1, 2048, 16, 7, 7]) which is expected and

(5): ResNetBasicHead(
        (pool): AvgPool3d(kernel_size=(4, 7, 7), stride=(1, 1, 1), padding=(0, 0, 0))
        (dropout): Dropout(p=0.5, inplace=False)
        (proj): Linear(in_features=2048, out_features=25, bias=True)
        (output_pool): AdaptiveAvgPool3d(output_size=1)
      )

ptrblck · September 1, 2021, 6:06am

I’m not sure which pooling layer you are referring to, but neither would change the number of channels.
Also, your current module would raise a shape mismatch, so I assume the forward method reshapes the activation etc.
Using the layers manually yield the expected shapes:

x = torch.randn(1, 3, 16, 224, 224)
pool = nn.AvgPool3d(kernel_size=(4, 7, 7), stride=(1, 1, 1), padding=(0, 0, 0))
out = pool(x)
print(out.shape)
> torch.Size([1, 3, 13, 218, 218])

lin = nn.Linear(in_features=2048, out_features=25, bias=True)
out = lin(out)
> RuntimeError: mat1 and mat2 shapes cannot be multiplied (8502x218 and 2048x25)

adaptive_pool = nn.AdaptiveAvgPool3d(output_size=1)
out = adaptive_pool(x)
print(out.shape)
> torch.Size([1, 3, 1, 1, 1])

jpainam · September 1, 2021, 1:11pm

(1, 3, 16, 224, 224) is the input to the model. for the Avg pool, I have

x = torch.randn(1, 2048, 16, 7, 7)
pool = nn.AvgPool3d(kernel_size=(4, 7, 7), stride=(1, 1, 1), padding=(0, 0, 0))
out = pool(x)
print(out.shape)
> torch.Size([1, 2048, 13, 1, 1])

I was expecting the same sequence length 16 and not 13. My understanding of pooling (2D or 3D) is that it only works on the spatial dimension.

ptrblck · September 1, 2021, 6:54pm

No, that’s not the case and is the difference between the 2D and 3D layers.
While the 2D layers apply the kernel on the H and W dimension for an input of [N, C, H, W], 3D layers will use a 3D kernel applied on D, H, W for an input of [N, C, D, H, W].