Hi, I’m working with the new library PytorchVideo. And I’m using ResNet3D
My input sequence for a single batch size is torch.Size([1, 3, 16, 224, 224]) and I was expecting to get after the average pool layer torch.Size([1, 2048, 16, 1, 1]) but I got torch.Size([1, 2048, 13, 1, 1]). I can’t figure out how the sequence length changed from 16 to 13.

Any helped?
I verified the inputs of that layer is torch.Size([1, 2048, 16, 7, 7]) which is expected and

I’m not sure which pooling layer you are referring to, but neither would change the number of channels.
Also, your current module would raise a shape mismatch, so I assume the forward method reshapes the activation etc.
Using the layers manually yield the expected shapes:

x = torch.randn(1, 3, 16, 224, 224)
pool = nn.AvgPool3d(kernel_size=(4, 7, 7), stride=(1, 1, 1), padding=(0, 0, 0))
out = pool(x)
print(out.shape)
> torch.Size([1, 3, 13, 218, 218])
lin = nn.Linear(in_features=2048, out_features=25, bias=True)
out = lin(out)
> RuntimeError: mat1 and mat2 shapes cannot be multiplied (8502x218 and 2048x25)
adaptive_pool = nn.AdaptiveAvgPool3d(output_size=1)
out = adaptive_pool(x)
print(out.shape)
> torch.Size([1, 3, 1, 1, 1])

No, that’s not the case and is the difference between the 2D and 3D layers.
While the 2D layers apply the kernel on the H and W dimension for an input of [N, C, H, W], 3D layers will use a 3D kernel applied on D, H, W for an input of [N, C, D, H, W].