RuntimeError: stack expects each tensor to be equal size, but got [1, 6502400] at entry 0 and [2, 2173694] at entry 1

Isaiah_addie · October 11, 2023, 8:09pm

Im trying to train a model to know whats background noise and whats not and I keep getting this error. Ive truncated it so it should be the same size but its not, any help would be appreciated! Im new to this so im not sure why the post is popping up weird

import torch
import torchaudio
from torch import nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

Load background noise and speech audio files

background_files = [“C:\Users\iaddi\Downloads\crowd_talking-6762.mp3”]
speech_files = [“C:\Users\iaddi\Downloads\Cam_1.mp3”]

Audio preprocessing

def pad_truncate(audio, max_length=16000):
print(f"Before: {len(audio)}")
if len(audio) > max_length:
audio = audio[:max_length]
elif len(audio) < max_length:
audio = F.pad(audio, (0, max_length - len(audio)))

return audio

Create dataset

class AudioDataset(Dataset):
def init(self, background_files, speech_files):
self.background_files = background_files
self.speech_files = speech_files

def __getitem__(self, index):
    if index < len(self.background_files):
        audio, sample_rate = torchaudio.load(self.background_files[index])
        audio = audio.squeeze(0)
        audio = pad_truncate(audio)
        print(len(audio))
        label = 0
    else:
        audio, sample_rate = torchaudio.load(self.speech_files[index-len(self.background_files)]) 
        label = 1
    return audio, label

def __len__(self):
    return len(self.background_files) + len(self.speech_files)

Create model

class AudioClassifier(nn.Module):
def init(self):
super().init()
self.conv1 = nn.Conv2d(1, 16, 3, stride=2, padding=1)
self.conv2 = nn.Conv2d(16, 32, 3, stride=2, padding=1)
self.conv3 = nn.Conv2d(32, 64, 3, stride=2, padding=1)

    self.classifier = nn.Linear(64, 1)
    
def forward(self, x):
    x = self.conv1(x)
    x = F.relu(x)
    
    x = self.conv2(x)
    x = F.relu(x)
    
    x = self.conv3(x)
    x = F.relu(x)
    
    x = torch.mean(x, dim=2)       
    x = self.classifier(x)
    
    return x

Train model

dataset = AudioDataset(background_files, speech_files)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

model = AudioClassifier()
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters())

num_epochs = 10
for epoch in range(num_epochs):
for i, (inputs, labels) in enumerate(dataloader):

    # Forward pass and loss
    outputs = model(inputs)
    loss = criterion(outputs, labels.unsqueeze(1).float())
    
    # Backward pass and update
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

print(‘Finished training’)

akt42 · October 11, 2023, 11:16pm

Some comments:

audio.squeeze(0) in __getitem__(): If you have a stereo audio file, audio will have a dimension of [2, length]. When you call audio.squeeze(0), it will still return a [2, length] tensor. Perhaps you need to use some approach that does stereo to mono conversion here.
Usage of len(audio) in pad_truncate(): In your case this will either return 1 or 2 because it will return the first dimension of the tensor. Maybe what you wanted was audio.shape[-1]?

Note: point (2) can be ignored if you set the channels_first=False in torchaudio.load()

Isaiah_addie · October 11, 2023, 11:59pm

I made this change to my get item:
getitem(self, index):
if index < len(self.background_files):
audio, sample_rate = torchaudio.load(self.background_files[index])
audio = torch.mean(audio, dim=0, keepdim=True)
audio = audio.squeeze(0)
audio = pad_truncate(audio)
print(len(audio))
label = 0
else:
audio, sample_rate = torchaudio.load(self.speech_files[index-len(self.background_files)])
audio = torch.mean(audio, dim=0, keepdim=False)
audio = pad_truncate(audio)
label = 1
return audio, label

but unfortunately its now saying that its expecting a 3d or 4d input but got 2d or “[2, 16000]”

Isaiah_addie · October 12, 2023, 12:01am

for your 2nd comment Im a little confused as for the direction you think I should go?

akt42 · October 12, 2023, 10:11am

[2, 16000] means you have a two-channel (stereo) audio array. Maybe you need to do the mono conversion to the background_files as well.

Please post the code and the compete error trace when referring errors. You can surround each of them by three backticks to format code and errors so it is more readable with proper indentation.

Eg:

``` <CODE> ```