I am trying to implement a CNN for a regression task on audio data. I am using mel-spectrograms as features with a pixel size of (64, 64). The network consist of two convolutional layers with max pooling and three additional fully connected layers.
I am facing problems with the input dimension of the first fully connected layer to flatten the output of the convolutional layers. The target size doesn’t match the input size and the following warning appears:
UserWarning: Using a target size (torch.Size([2592, 1])) that is different to the input size (torch.Size([1, 1])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size. return F.mse_loss(input, target, reduction=self.reduction)
By chance I figured out that with a input size of 1568 the warning doesn’t appear and the network is successfully training.
I am using a batch size of 162 and one channel. The input dimension of X and y of one iteration are the following:
X.shape = [162, 1, 64, 64]
y.shape = [162, 1]
Here the code of the network
class Basic_CNN(nn.Module): def __init__(self): super(Basic_CNN, self).__init__() self.conv1 = nn.Conv2d(1, 16, (5, 5), padding=1, stride=1) self.pool = nn.MaxPool2d(2, 2) self.conv2 = nn.Conv2d(16, 8, (5, 5), padding=1, stride=1) self.fc1 = nn.Linear(1568, 120) self.fc2 = nn.Linear(120, 64) self.fc3 = nn.Linear(64, 1) def forward(self, x): # Convolutional layer with ReLU and 2x2 pooling x = self.pool(F.relu(self.conv1(x))) x = self.pool(F.relu(self.conv2(x))) # flatten output x = x.view(-1, 1568) # fc layers x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) # output layer x = self.fc3(x) return x
Could somebody be so kind and explain to me how to choose the input size of this specific layer? How is it dependend on the output size of the previous layers?
Thank you very much in advance.
p.s. : sorry for first deleted post