PytorchVideo - AvgPool3d has reduced the sequence length (From 16 to 13)

Hi, I’m working with the new library PytorchVideo. And I’m using ResNet3D
My input sequence for a single batch size is torch.Size([1, 3, 16, 224, 224]) and I was expecting to get after the average pool layer torch.Size([1, 2048, 16, 1, 1]) but I got torch.Size([1, 2048, 13, 1, 1]). I can’t figure out how the sequence length changed from 16 to 13.

Any helped?
I verified the inputs of that layer is torch.Size([1, 2048, 16, 7, 7]) which is expected and

(5): ResNetBasicHead(
        (pool): AvgPool3d(kernel_size=(4, 7, 7), stride=(1, 1, 1), padding=(0, 0, 0))
        (dropout): Dropout(p=0.5, inplace=False)
        (proj): Linear(in_features=2048, out_features=25, bias=True)
        (output_pool): AdaptiveAvgPool3d(output_size=1)
      )

I’m not sure which pooling layer you are referring to, but neither would change the number of channels.
Also, your current module would raise a shape mismatch, so I assume the forward method reshapes the activation etc.
Using the layers manually yield the expected shapes:

x = torch.randn(1, 3, 16, 224, 224)
pool = nn.AvgPool3d(kernel_size=(4, 7, 7), stride=(1, 1, 1), padding=(0, 0, 0))
out = pool(x)
print(out.shape)
> torch.Size([1, 3, 13, 218, 218])

lin = nn.Linear(in_features=2048, out_features=25, bias=True)
out = lin(out)
> RuntimeError: mat1 and mat2 shapes cannot be multiplied (8502x218 and 2048x25)

adaptive_pool = nn.AdaptiveAvgPool3d(output_size=1)
out = adaptive_pool(x)
print(out.shape)
> torch.Size([1, 3, 1, 1, 1])

(1, 3, 16, 224, 224) is the input to the model. for the Avg pool, I have

x = torch.randn(1, 2048, 16, 7, 7)
pool = nn.AvgPool3d(kernel_size=(4, 7, 7), stride=(1, 1, 1), padding=(0, 0, 0))
out = pool(x)
print(out.shape)
> torch.Size([1, 2048, 13, 1, 1])

I was expecting the same sequence length 16 and not 13. My understanding of pooling (2D or 3D) is that it only works on the spatial dimension.

No, that’s not the case and is the difference between the 2D and 3D layers.
While the 2D layers apply the kernel on the H and W dimension for an input of [N, C, H, W], 3D layers will use a 3D kernel applied on D, H, W for an input of [N, C, D, H, W].