What is the efficient way to reduce the dimension of 3d convolution filters?

I have 2 questions with the dimension of the tensor.

  1. How do I convert my model output torch.Size([5, 1024, 7, 2, 4]) to torch.Size([5,1024,7,1,1]) ? I thought of slicing, but is that good idea? or there is another good way to do it?

  2. I have two torch.Size([5,1024,7,1,1]) output from the network, I would like to concatenate these two outputs and feed to two-layer Fully Connected layer. (layer 1 output :1024, layer 2 output: 1). What would be the best way to do it?

Thank you.

  1. It looks like you are working with some volumetric data based on the shape. You could apply some pooling, or another conv layer over the last two dimensions. I don’t think there is a “best” way and it will most likely depend on your model and use case.

  2. Usually you would flatten the tensors to have the shape [batch_size, in_features]. However, nn.Linear layers also take arbitrary dimensions for inputs such as [batch_size, *, in_features] and apply the linear transformation on each sample in the asterisk dimensions.
    Would you like to apply the linear layers separately on each of the inputs or would you like to concatenate the inputs in the feature dimension?

  1. Since I don’t want to modify the network, I used torch.max to get the shape of [5, 1024, 7] from [5, 1024, 7, 2, 4].
out = torch.max(x, dim=3)[0]
out = torch.max(out, dim=3)[0]

I hope it will not lose any important feature from the final layer.

  1. I would like to concatenate the input. Yes flattening helped me to achieve this.
    def __init__(self, spatial_model, temporal_model):
        super(FusionNet, self).__init__()
        self.spatial_model = spatial_model
        self.temporal_model = temporal_model
        self.fc = nn.Sequential(nn.Linear(1024*7*2, 1024, bias=True), nn.ReLU(inplace=True), nn.Dropout(p = 0.5),
                                nn.Linear(1024, 1))

    def forward(self, spatial_input, temporal_input):
        spatial_output = self.spatial_model(spatial_input)
        temporal_output = self.temporal_model(temporal_input)
        fused = torch.cat((spatial_output, temporal_output), dim = 1)
        out = fused.view(fused.size(0), -1)
        out = self.fc(out)
        return out

Thanks again