Mismatched dimensions

I’m trying to make a CNN audio classification model

class AudioClassification(nn.Module):
    def __init__(self, input_count,output_count, channel_count=32, stride=16 ):
        self.conv1 = nn.Conv1d(input_count, channel_count, kernel_size=80, stride=stride)
        self.bn1 = nn.BatchNorm1d(channel_count)
        self.pool1 = nn.MaxPool1d(4)
        self.conv2 = nn.Conv1d(channel_count, channel_count, kernel_size=3)
        self.bn2 = nn.BatchNorm1d(channel_count)
        self.pool2 = nn.MaxPool1d(4)
        self.fc1 = nn.Linear(2 * channel_count, output_count)
    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(self.bn1(x))
        x = self.pool1(x)
        x = self.conv2(x)
        x = F.relu(self.bn2(x))
        x = self.pool2(x)
        x = self.fc1(x)
        return F.log_softmax(x, dim=2)

def train(model, epoch, log_interval):
    for batch_count, (data, target) in enumerate(train_loader):
        data = transform(data)
        output = model(data)
        loss = F = F.nll_loss(output.squeeze(), target)
        if batch_count % log_interval == 0:
            print(f"Train Epoch: {epoch} [{batch_idx * len(data)}/{len(train_loader.dataset)}({100. * batch_idx / len(train_loader):.0f}%)]\tLoss: {loss.item():.6f}")

and I’m running into the following error:
RuntimeError: Expected 3-dimensional input for 3-dimensional weight [32, 22050, 80], but got 2-dimensional input of size [32, 22050] instead

I understand its from the kernel_size but I’m not sure how to change my data to include it

Hi Gar!

When you pass an input to a Conv1d, it is expected to have shape
[nBatch, channels, length], where channels matches the
in_channels with which you instantiated the Conv1d. (nBatch and
length are not required by a given Conv1d to have specific values.)

It looks like you are passing a two-dimensional tensor into your model
that presumably lacks an nBatch dimension. nBatch can be 1, but
the nBatch dimension has to be there.

It looks like the length of your input sample is 22050. That’s fine.

I believe that the main problem is with conv1, the first Conv1d in your
model, and that you are mixing up the length of your input with the
with the number of input channels (in_channels) expected by conv1.

It appears that you are instantiating conv1 as:

self.conv1 = nn.Conv1d (in_channels = 22050, out_channels = 32, kernel_size = 80)

If the input to your model really does have 32 channels, then you would
want something like:

self.conv1 = nn.Conv1d (in_channels = 32, out_channels = 32, kernel_size = 80)

(It’s perfectly reasonable to have in_channels and out_channels be the
same, but they don’t have to be. However, in_channels does have to be
the same as the number channels of the input passed into your model.)

What is the shape of the input to your model (data, in your training loop)?
How many channels does it have? What is the meaning and value of
channel_count – used internally by your model?


K. Frank