Input dimension of first dense layer in a CNN #2

Hello everybody,
I am trying to implement a CNN for a regression task on audio data. I am using mel-spectrograms as features with a pixel size of (64, 64). The network consist of two convolutional layers with max pooling and three additional fully connected layers.

I am facing problems with the input dimension of the first fully connected layer to flatten the output of the convolutional layers. The target size doesn’t match the input size and the following warning appears:

UserWarning: Using a target size (torch.Size([2592, 1])) that is different to the input size (torch.Size([1, 1])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size. return F.mse_loss(input, target, reduction=self.reduction)

By chance I figured out that with a input size of 1568 the warning doesn’t appear and the network is successfully training.

I am using a batch size of 162 and one channel. The input dimension of X and y of one iteration are the following:
X.shape = [162, 1, 64, 64]
y.shape = [162, 1]

Here the code of the network

class Basic_CNN(nn.Module):
    def __init__(self):
        super(Basic_CNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 16, (5, 5), padding=1, stride=1)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(16, 8, (5, 5), padding=1, stride=1)
        self.fc1 = nn.Linear(1568, 120)
        self.fc2 = nn.Linear(120, 64)
        self.fc3 = nn.Linear(64, 1)

    def forward(self, x):
        # Convolutional layer with ReLU and 2x2 pooling
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        # flatten output
        x = x.view(-1, 1568)
        # fc layers
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        # output layer
        x = self.fc3(x)
        return x

Could somebody be so kind and explain to me how to choose the input size of this specific layer? How is it dependend on the output size of the previous layers?

Thank you very much in advance.

p.s. : sorry for first deleted post


the warning you are getting seems to be raised by your loss function if you look at the end of your message [return F.mse_loss(input, target, reduction=self.reduction)]. This is because MSE expects that your prediction and your target are of the same shape. Maybe you can show us a bit of your training loop, so we can help you more.

And now to answer your questions:

Could somebody be so kind and explain to me how to choose the input size of this specific layer? How is it dependend on the output size of the previous layers?

When you go from a Convolution to a Linear-Layer you want to flatten your learned features, because a Conv2d Layer outputs a 4d-Tensor [B, C, H_out, W_out] and a Linear Layer takes in a 2d-Tensor [B, F_in] . Those features are contained in your Channel, Width and Height dimension, so you want them flattened to be equal to F_in (C*H_out*W_out = F_in).
Normally you can also apply something like a adaptive average pooling before you flatten your input, because the height and width of the output of a convolution depends on the height and width of the input, which can cause C*H_out*W_out != F_in and you would get an error, because of mismatching shapes of the weights in your Linear Layer and your input. By applying adaptive average pooling you define the output shape of the input given to the AdaptiveAvgPooling Module. When you now flatten this output, its guaranteed to have the same Height and Width all time and C*H_out*W_out = F_in, will (hopefully) always be true.

PS: Maybe you can also have a look at Conv1ds, because mel-spectrograms are normally sequential data of shape [B, n_mels, time], but just saying.


thank you very much. This helps a lot.

This is the code of my training loop:

def train(train_loader, model, optimizer, criterion, n_epochs, batch_size):
    cost = []
    metric_hist = []
    for epoch in range(n_epochs):
        total_loss = total_metrics = 0
        for i, (x, y) in enumerate(train_loader):
            # reshape input to 4D   =>  [batch_size, N_channels=1, Height, Width]
            x = np.reshape(x, [batch_size, 1, x.shape[1], x.shape[2]])
            # reshape target to 2D  =>  [batch_size, N_channels=1]
            y = np.reshape(y, [batch_size, 1])

            # convert to float and load to GPU
            x = x.float().to(device)
            y = y.float().to(device)


            # forward
            y_pred = model(x)

            # calc loss & rmse
            loss = criterion(y_pred, y)                                     # LOSS = Mean Square Error
            metrics = rmse(y_pred, y)                                    # METRICS = Root Mean Square Error

            # backward

            # append results
            total_loss += loss.item()
            total_metrics += metrics.item()

            # print status every 100th step
            if (i+1) % 100 == 0:
                print(i+1, " out of ", len(train_loader), "iterations: ",
                      round(((i+1) / len(train_loader)) * 100, 2), "% Done -- LOSS: ", round(loss.item(), 3),
                      "-- RMSE: ", round(metrics.item(), 3))

        # calc average loss of epoch and append to cost
        avg_metrics = total_metrics / len(train_loader)
        avg_loss = total_loss / len(train_loader)

        print(epoch+1, " out of ", n_epochs, "epochs: ", round(((epoch+1) / n_epochs) * 100, 3)
              , "% Done --", "LOSS: ", round(avg_loss, 2), "-- RMSE: ", round(avg_metrics, 3), '\n')

    return cost, metric_hist

Hi @BeneFr,

I’m pretty confused, so your input values seem to be of shape [B, 1, H, W] - like expected - and your target of shape [B, 1], which matches with the expected output shape [B, 1] of your model. However your error message states that your input for F.mse_loss() - so your model output - is of shape [2592, 1] and your target is of shape [1, 1]. (Also 2592 / 162 = 16)
Maybe you can print the shape of x and y before you input them to the model and also your model output y_pred? Do you use MSELoss anywhere else than for your loss and metrics?

Hi Caruso,

the shape of x is [162, 1, 64, 64] and the shape of y is [162, 1] before I input them to the model. The shape of the model output y_pred is [162, 1]. I only use MSELoss for my loss in the training and testing loop.

The warning doesn’t appear when I use 1568 as input for the flatten layer. I just don’t understand why.


The warning doesn’t appear when I use 1568 as input for the flatten layer. I just don’t understand why.

What other shape do you use for the input of your flatten layer? The output of your last conv is of shape [B, 8, 14,14] -> flattened [B, 1568], so anything else would raise a size mismatch error.

Hi Caruso,

thank you very much for your help! I finally figured out how to calculate the conv layer output dimensions.

With the formular:

Width_out = (Width_in - Width_filter + 2 * padding) / stride +1

equivalent for the heigth.

Than with 2x2 pooling the result is halved. Afterwards (Width_out * Height_out * N_output_channels) leads to the number of inputs for the flatten layer.

Am I correct so far?

Hi BeneFr,

yes your formular is right, in the docs of the Conv2d-Layer you can find the formular too :smiley: .
You can also print the shape after the last convolution, if you don’t want to calculate the output shape ^^