Configuration of the input channels for the first linear layer in a convolutional neural network is tricky

I have a training dataset of melgrams with each melgram having shape [21,128]. The training sample has 3200 melgrams. The dataloader object that I create with a batch size of 16 has batches of shape [16,21,128]. Thus 16 melgrams per batch.

I have created the following neural network with PyTorch (torch module). Briefly, the network has 4 convolutional layers and 4 linear layers in a sequential order. The model structure is as follows:

class ConvolutionalNeuralNetwork_pooling(nn.Module):
  def __init__(self):
    super(ConvolutionalNeuralNetwork_pooling, self).__init__()
    #initialize features
    self.input_units = 1
    self.output_units = 4
    self.kernel_size = 5
    #convolutional layers
    #(in_channels, out_channels, kernel_size)
    self.conv1 = nn.Conv2d(self.input_units, 16, self.kernel_size, padding=2)
    self.conv2 = nn.Conv2d(16, 32, self.kernel_size, padding=2)
    self.conv3 = nn.Conv2d(32, 64, self.kernel_size, padding=2)
    self.conv4 = nn.Conv2d(64, 128, self.kernel_size, padding=2)
    self.fc1 = nn.Linear(128*2*4, 1024) #here is the tricky part
    self.fc2 = nn.Linear(1024, 256)
    self.fc3 = nn.Linear(256, 32)
    self.fc4 = nn.Linear(32, self.output_units)
    #initialiaze max pooling layer
    self.max_pool = nn.MaxPool2d(kernel_size=self.pool_kernel_size)
    #initialize non-linear activation function
    self.activation = nn.ReLU()
    #initialized weights

  def _init_weights(self, module):
    if isinstance(module, nn.Linear):, std=1.0)
      if module.bias is not None:
    elif isinstance(module, nn.Conv2d):
        if module.bias is not None:

  def forward(self, x):
    x = x.unsqueeze(1)
    x = self.max_pool(self.activation(self.conv1(x)))
    x = self.max_pool(self.activation(self.conv2(x)))
    x = self.max_pool(self.activation(self.conv3(x)))
    x = self.max_pool(self.activation(self.conv4(x)))
    x = x.view(x.size(0), -1)
    x = self.activation(self.fc1(x))
    x = self.activation(self.fc2(x))
    x = self.activation(self.fc3(x))
    x = self.fc4(x)
    return x

My question is the following:

If I have the first linear layer as:
self.fc1 = nn.Linear(128, 1024) I receive the following error during weight calculation in the first linear layer:

RuntimeError: mat1 and mat2 shapes cannot be multiplied (16x1024 and 128x1024)

The shape of x during training is:

input: torch.Size([16, 1, 21, 128])
after conv1: torch.Size([16, 16, 10, 64])
after conv2: torch.Size([16, 32, 5, 32])
after conv3: torch.Size([16, 64, 2, 16])
after conv4: torch.Size([16, 128, 1, 8])
after flattening-before 1st linear layer: torch.Size([16, 1024])

However, when I replace

this self.fc1 = nn.Linear(128, 1024) to self.fc1 = nn.Linear(128*2*4, 1024)

The x has now shape [1024, 1024] and the calculation of the weights is completed successfully. Is there any general rule that is applied here when using max_pooling() and padding in the convolutional networks?. Because it’s not clear to me why I should apply this multiplication in the input channels of the first linear layer.

Hi Nikos!

The convolutional section of a convolutional neural network has the nice
feature that it can accept input of more-or-less any spatial shape. But, as
you’ve noticed, when the convolutional section is then joined with some
fully-collected layers, the spatial-shape-independence can break down.

If, in your use case, the input to your network will always have the same
shape (except for the batch size, which can vary), your solution of modifying
fc1 to self.fc1 = nn.Linear(128*2*4, 1024) is perfectly reasonable.

On the other hand, if you want to preserve the ability of your network to
accept inputs of varying shapes, your solution of modifying fc1 will break
again when you pass into your network input of some other shape.

Pytorch offers adaptive-pooling layers that support the most common
approach to this issue.

In your case, it appears that you want the output of:

x = self.max_pool(self.activation(self.conv4(x)))

to have shape [16, 128, 1, 1] rather than shape [16, 128, 1, 8].

You can achieve this by replacing your last MaxPool2d layer with

self.adapt_pool = nn.AdaptiveMaxPool2d (output_size = 1)
x = self.adapt_pool (self.activation (self.conv4 (x)))

AdaptiveMaxPool2d (1) adjusts its pooling algorithm so that its output
has spatial shape [1, 1], regardless of the spatial shape of its input, and,
hence, regardless of the spatial shape the input to the network.


K. Frank

Thanks fro the reply Frank. I was wondering why nn.Linear(128*2*4, 1024) is a reasonable input channel and for the first linear layer and not something like nn.Linear(128164, 1024) for example. Why should I multiply by 2*4.

Thanks again for your suggestions and the solution with adaptive pooling. Good to know there is also this option.