RuntimeError: Given groups=1, weight of size [512, 13, 5], expected input[128, 1, 13] to have 13 channels, but got 1 channels instead

I’m having issue with input/output sizes.
This is my dataloader:

train_data = torch.hstack((train_feat, train_labels))
train_loader = torch.utils.data.DataLoader(train_data, batch_size= 128, shuffle=True)

This is my dataset:

print(len(train_loader))
24079

print((train_feat.shape))
torch.Size([3082092, 13])

print((train_labels.shape))
torch.Size([3082092, 1])

This is my model:

class TDNN(nn.Module):

    def __init__(self, feat_dim=13, embedding_size=512, num_classes=51,
                 config_str='batchnorm-relu'):
        super(TDNN, self).__init__()

        self.network = nn.Sequential(OrderedDict([
            ('tdnn1', TDNNLayer(feat_dim, 512, 5, dilation=1, padding=0,
                                config_str=config_str)),
            ('tdnn2', TDNNLayer(512, 512, 3, dilation=2, padding=0,
                                config_str=config_str)),
            ('tdnn3', TDNNLayer(512, 512, 3, dilation=3, padding=0,
                                config_str=config_str)),
            ('tdnn4', DenseLayer(512, 512, config_str=config_str)),
            ('tdnn5', DenseLayer(512, 1500, config_str=config_str)),
            ('stats', StatsPool()),
            ('affine', nn.Linear(3000, embedding_size))
        ]))
        self.nonlinear = get_nonlinear(config_str, embedding_size)
        self.dense = DenseLayer(embedding_size, embedding_size, config_str=config_str)
        self.classifier = nn.Linear(embedding_size, num_classes)

        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.kaiming_normal_(m.weight.data)
                nn.init.zeros_(m.bias)

    def forward(self, x):
        x = self.network(x)
        if self.training:
            x = self.dense(self.nonlinear(x))
            x = self.classifier(x)
        return x

The input should be batches of 128. I’m probably making an error with the data here:

class IterMeter(object):
    """keeps track of total iterations"""
    def __init__(self):
        self.val = 0

    def step(self):
        self.val += 1

    def get(self):
        return self.val


def train(model, device, train_loader, criterion, optimizer, scheduler, epoch, iter_meter, experiment):
    model.train()
    data_len = len(train_loader.dataset)
    with experiment.train():
        for batch_idx, _data in enumerate(train_loader):
            features, labels = _data[:, :-1], _data[:, -1] 
            features, labels = features.to(device), labels.to(device)
            features = features.unsqueeze(dim=1)
            
            optimizer.zero_grad()

            output = model(features)  # (batch, n_class)

            loss = criterion(output, labels)
            loss.backward()

            experiment.log_metric('loss', loss.item(), step=iter_meter.get())
            experiment.log_metric('learning_rate', scheduler.get_lr(), step=iter_meter.get())

            optimizer.step()
            scheduler.step()
            iter_meter.step()
            if batch_idx % 100 == 0 or batch_idx == data_len:
                print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                    epoch, batch_idx * len(features), data_len,
                    100. * batch_idx / len(train_loader), loss.item()))

iter_meter = IterMeter()
for epoch in range(1, epochs + 1):
    train(model, device, train_loader, criterion, optimizer, scheduler, epoch, iter_meter, experiment)

If I remore the features.unsqueeze(dim=1), I get:

RuntimeError: Expected 3-dimensional input for 3-dimensional weight [512, 13, 5], but got 2-dimensional input of size [128, 13] instead

@ptrblck any idea what’s causing the error?

Depending on what makes sense, you should be able to permute() the dimensions or reshape() so that the number of input channels is what you want.

@eqy Thank you so much for your reply.
I’m a noob so I could use some more explanation.
The data is audio frames. Each frame has 13 features. The goal is to input 128 of those frames each time and end up with a linear classifier to classify each frame with one of the 51 classes. I hope that makes sense

Looking more closely, what happens if you simply torch.unsqueeze(dim=2) to make the input data shape [128, 13, 1]?

It gave a different type of error:

def train(model, device, train_loader, criterion, optimizer, scheduler, epoch, iter_meter, experiment):
    model.train()
    data_len = len(train_loader.dataset)
    with experiment.train():
        for batch_idx, _data in enumerate(train_loader):
            features, labels = _data[:, :-1], _data[:, -1] 
            features, labels = features.to(device), labels.to(device)
            features = features.unsqueeze(dim=2) # ---> reshape
            
            optimizer.zero_grad()

            output = model(features)  # (batch, n_class)

            loss = criterion(output, labels)
            loss.backward()

            experiment.log_metric('loss', loss.item(), step=iter_meter.get())
            experiment.log_metric('learning_rate', scheduler.get_lr(), step=iter_meter.get())

            optimizer.step()
            scheduler.step()
            iter_meter.step()
            if batch_idx % 100 == 0 or batch_idx == data_len:
                print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                    epoch, batch_idx * len(features), data_len,
                    100. * batch_idx / len(train_loader), loss.item()))
RuntimeError: Calculated padded input size per channel: (1). Kernel size: (5). Kernel size can't be greater than actual input size

Right, you might need to consider what the meaning of your training data is. At the moment, you are passing in a batch of 128 examples, where each example has 13 channels but only a length of 1. I assume the TDNNLayer is something like a 1D convolution, so in this case it cannot compute an output when the filter size length (5) is greater than the input size (1). The fundamental issue is that the sequence length must be increased to use the layer.

The data is audio frames. Each frame has 13 features. The goal is to input 128 of those frames each time and end up with a linear classifier to classify each frame with one of the 51 classes.

From this I understand that your sequence length is 13. Its also clear that TDNN layer has a Conv1d block to which you are currently passing in_channels as 13, which I believe is incorrect. Could you try changing feat_dim to 1 and keeping features.unsqueeze(dim=1). It shouldn’t throw an error now, but I’m not sure how the output will react.

1 Like

I don’t understand why the input channels needs to be 1, my understanding is that the input is 13 since data size is (N * 13), which means that the input should be 128 * 1 * 13. Correct?
After changing feat_dim to 1 , with features.unsqueeze(dim=1), I get this error:

RuntimeError: Calculated padded input size per channel: (5). Kernel size: (7). Kernel size can't be greater than actual input size

Any thoughts how to fix the issue?