CNN model check

In the following code, I’ve tried to build a CNN model with 5-layers as following:
Layer 1: Convolutional with: filter = 32, kernel = 3x3, padding = same, pooling = Max pool 3x3, dropout = 0.1
Layer 2: Convolutional with: filter = 32, kernel = 3x3, padding = valid, pooling = Max pool 3x3, dropout = 0.2
Layer 3: Fully connected with: Neurons = 512, dropout=0.2
Layer 4: Fully connected with: Neurons = 265, dropout=0.2
Layer 5: Fully connected with: Neurons = 100, dropout=0.2

here is the code I wrote:

def __init__(self):
        super(MaqamCNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.relu1 = nn.ReLU()
        self.pool1 = nn.MaxPool2d(kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        self.dropout1 = nn.Dropout(p=0.1)
        self.conv2 = nn.Conv2d(32, 32, kernel_size=3)
        self.relu2 = nn.ReLU()
        self.pool2 = nn.MaxPool2d(kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        self.dropout2 = nn.Dropout(p=0.2)
        self.fc1 = nn.Linear(30*192*192, 512)
        self.dropout3 = nn.Dropout(p=0.2)
        self.fc2 = nn.Linear(512, 265)
        self.dropout4 = nn.Dropout(p=0.2)
        self.fc3 = nn.Linear(265, 100)
        
    def forward(self, x):
        x = self.conv1(x)
        x = self.relu1(x)
        x = self.pool1(x)
        x = self.dropout1(x)
        x = self.conv2(x)
        x = self.relu2(x)
        x = self.pool2(x)
        x = self.dropout2(x)
        x = x.view(-1, 30*192*192)
        x = self.fc1(x)
        x = self.dropout3(x)
        x = self.fc2(x)
        x = self.dropout4(x)
        x = self.fc3(x)
        return x

The expected input of the model is audio sample with 48kHz of length 30seconds, with batch size = 2.
When I run the code, I get the following error:

in this line: x = self.conv2(x)
RuntimeError: Calculated padded input size per channel: (1440000 x 1). Kernel size: (3 x 3). Kernel size can't be greater than the actual input size

I’ve made my calculations for the layers, and I think that the input size and properties should work perfectly with the layers as I’ve described, But I think that the code I wrote is not doing what exactly I thought about the layer.
Any idea what is wrong right here? the implementation of the model or the model architecture values?

If they are 1 channel audio inputs, Conv2d might not be appropriate. Have you tried Conv1d/MaxPool1d?

Thx! So finally it did work, but I have another problem right now, according to the article algorithm I’m trying to implement, The best results they got is by training the model on 30seconds wave files, with batch 64, and fully connected layers as I’ve mentioned, first one with Neurons = 512, the second one with Neurons = 265, and the third one with Neurons = 100, But I have problem with the model that with these parameters, it gives me the following error with the memory:

RuntimeError: CUDA out of memory. Tried to allocate 10.99 GiB (GPU 0; 8.00 GiB total capacity; 3.60 GiB already allocated; 3.37 GiB free; 3.60 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I don’t think that it should have this much memory!
can someone explain what’s going on right here?
Note that every audio sample has a size of 6MB.

Have you tried reducing the batch size to see if it works for a smaller batch? Most server grade GPUs are 40Gb/per.

I’ve tried, and it gives me the following error for batch_size = 2:

RuntimeError: CUDA out of memory. Tried to allocate 9.77 GiB (GPU 0; 8.00 GiB total capacity; 13.50 KiB already allocated; 6.97 GiB free; 2.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

So in the way I see it, I must have more memory to run it in the these parameters, I have two questions:
1)Is it possible to use RAM memory in addition to the GPU memory? And if it is possible how can I do that?
2)Is moving to Google Colab Pro can help me in this situation? does the pro version offers enough memory to do the job?
Thx a lot!

That seems unusually high.

Can you run the following on the model input:

print(np.prod([x for x in model_input.size()]))

Also, can you post your updated model and train code?

Sure! The line is not running because of the error…
Here is the train.py code:

import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import dataset
import model
import torch.nn.functional as F
import torch.nn.utils.rnn as rnn_utils
from dataset import MaqamDataset
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np


def pad_to_max_length(self, max_length):
    for i in range(len(self)):
        padded_data = F.pad(self.data[i][0], (0, max_length - len(self.data[i][0])), 'constant', 0)
        padded_data = padded_data.unsqueeze(0) if len(padded_data.shape) == 1 else padded_data
        padded_data = padded_data.unsqueeze(1)
        padded_data = padded_data.repeat(1, 32, 1, 1)
        self.data[i] = (padded_data, self.data[i][1])

def MFCC_plot(mfcc):
        plt.figure(figsize=(10, 4))
        mfcc = mfcc.detach().numpy()
        mfcc = mfcc.mean(axis=2).T
        librosa.display.specshow(mfcc, x_axis='time')
        plt.colorbar()
        plt.title('MFCC')
        plt.tight_layout()
        plt.show()

#clean GPU torch cache
torch.cuda.empty_cache()
# Define hyperparameters
batch_size = 2 # should be 64 according to page 7
learning_rate = 0.0001 #page 7 0.0001
num_epochs = 1 #should be 35 according to page 7

# Load the dataset
train_dataset = dataset.MaqamDataset(mode='train')

# Find the maximum length of the input tensors
max_length = 0
for i in range(len(train_dataset)):
    inputs, labels, mfcc = train_dataset[i]
    if inputs.shape[0] > max_length:
        max_length = inputs.shape[0]

# Pad all input tensors to the maximum length
# MaqamDataset.pad_to_max_length(1440000)
train_loader = data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

# Define the model
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = model.MaqamCNN().to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
print(torch.cuda.is_available())
# Train the model
print("Starting training!")
for epoch in range(num_epochs):
    # print("in epoch number ", epoch)
    running_loss = 0.0
    for i, data in enumerate(train_loader, 0):
        # print("in process number ", i)
        inputs, labels, mfcc = data
        # MFCC_plot(mfcc)
        labels = labels.to(device)
        # print("inputs.shape = ", inputs.shape)
        inputs = inputs.unsqueeze(1).unsqueeze(3).cuda()
        optimizer.zero_grad()
        outputs = model(inputs)
        # print("Outputs shape = ", outputs.shape)
        batch_size1 = outputs.size(0)
        padding_size = max_length - outputs.size(1)
        padding = torch.zeros(batch_size1, padding_size).to(device)
        padded_outputs = torch.cat((outputs, padding), dim=1)
        loss = criterion(padded_outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    print('Epoch %d, loss: %.3f' % (epoch + 1, running_loss / len(train_loader)))

# Save the model
torch.save(model.state_dict(), 'maqam_cnn2.pth')

And here is the model.py code:

import torch.nn as nn
import torch
import numpy as np
class MaqamCNN(nn.Module):
    def __init__(self):
        super(MaqamCNN, self).__init__()
        
        self.conv1 = nn.Conv1d(in_channels=1, out_channels=32, kernel_size=3, padding=1)
        self.relu1 = nn.ReLU()
        self.pool1 = nn.MaxPool1d(kernel_size=3)
        self.dropout1 = nn.Dropout(p=0.1)
        
        self.conv2 = nn.Conv1d(in_channels=32, out_channels=32, kernel_size=3, padding=0)
        self.relu2 = nn.ReLU()
        self.pool2 = nn.MaxPool1d(kernel_size=3)
        self.dropout2 = nn.Dropout(p=0.2)
        
        self.fc1 = nn.Linear(5119968, 512)
        self.dropout3 = nn.Dropout(p=0.2)

        self.fc2 = nn.Linear(265, 64)
        self.dropout4 = nn.Dropout(p=0.2)

        self.fc3 = nn.Linear(64, 100)
        self.dropout5 = nn.Dropout(p=0.2)

    def forward(self, x):
        x = torch.squeeze(x, 3)
        # x = np.transpose(x, (0, 2, 1))
        # print("0 - x.shape = ", x.shape)
        x = self.conv1(x)
        x = self.relu1(x)
        # print("1 - x.shape = ", x.shape)
        x = self.pool1(x)
        # print("2 - x.shape = ", x.shape)
        x = self.dropout1(x)
        # print("3 - x.shape = ", x.shape)

        x = self.conv2(x)
        # print("4 - x.shape = ", x.shape)
        x = self.relu2(x)
        x = self.pool2(x)
        x = self.dropout2(x)

        # x = x.view(-1, 30*192*192)
        x = x.view(x.size(0), -1)
        # print("5 - x.shape = ", x.shape)        
        x = self.fc1(x)
        x = self.dropout3(x)

        x = self.fc2(x)
        x = self.dropout4(x)

        x = self.fc3(x)
        x = self.dropout5(x)
        return x

And here is the dataset.py code:

import os
import torchaudio
import torch
from torch.utils.data import Dataset
import torch.nn as nn
import torch.nn.functional as F
import librosa
import numpy as np
class MaqamDataset(Dataset):
    def __init__(self, mode='train', transform=None):
        self.mode = mode
        self.transform = transform
        self.data_dir = r"C:\Users\USER\Documents\GitHub\dataset_cutten30"

        self.maqams = ['Ajam', 'Bayat', 'Hijaz', 'Kurd', 'Nahawand', 'Rast', 'Saba', 'Seka']
        self.audio_list = self._load_audio_list()
        self.data = [self.__getitem__(i) for i in range(len(self))]
        self.pad_to_max_length(1440000)

    def _load_audio_list(self):
        audio_list = []
        for i, maqam in enumerate(self.maqams):
            label_dir = os.path.join(self.data_dir, maqam)
            audio_list += [(os.path.join(label_dir, audio_name), i) for audio_name in os.listdir(label_dir) if audio_name.endswith('.wav')]
        return audio_list

    def __len__(self):
        return len(self.audio_list)

    def __getitem__(self, idx):
        audio_path, label_idx = self.audio_list[idx]
        waveform, sample_rate = torchaudio.load(audio_path)
        waveform = waveform[0] # only keep the first channel
        if self.transform:
            waveform = self.transform(waveform)
        mfcc = self.compute_mfcc(waveform)
        return waveform, label_idx, mfcc
    
    def pad_to_max_length(self, max_length):
        for i in range(len(self)):
            padded_data = F.pad(self.data[i][0], (0, max_length - len(self.data[i][0])), 'constant', 0)
            self.data[i] = (padded_data, self.data[i][1])

    def compute_mfcc(self, waveform):
        # Compute the MFCC of the waveform
        n_fft = 2048
        hop_length = 512
        n_mels = 128
        sr = 48000
        waveform = waveform.numpy()  # Convert PyTorch tensor to NumPy array
        mfcc = librosa.feature.mfcc(y=waveform, sr=sr, n_fft=n_fft, hop_length=hop_length, n_mels=n_mels, n_mfcc=20)
        mfcc = np.transpose(mfcc)
        mfcc = mfcc.astype(np.float32)  # Ensure data type is compatible with np.issubdtype()
        return mfcc
            

self.fc1 layer has ~2.62 billion parameters. That, alone, is over 10Gb at float32, not considering the copies made by your optimizer/autograd. Perhaps you can try distilling your data down further with more conv and maxpool layers before going to linear.

I have changed the model to the following one so it can run, with batch size = 4, I think that it is very bad compared to the one I should have implemented…
It also takes a lot of time to run, about 1.5 minutes for one epoch!
but I don’t know what else can I do!

import torch.nn as nn
import torch
import numpy as np
class MaqamCNN(nn.Module):
    def __init__(self):
        super(MaqamCNN, self).__init__()
        
        self.conv1 = nn.Conv1d(in_channels=1, out_channels=32, kernel_size=3, padding=1)
        self.relu1 = nn.ReLU()
        self.pool1 = nn.MaxPool1d(kernel_size=3)
        self.dropout1 = nn.Dropout(p=0.1)
        
        self.conv2 = nn.Conv1d(in_channels=32, out_channels=32, kernel_size=3, padding=0)
        self.relu2 = nn.ReLU()
        self.pool2 = nn.MaxPool1d(kernel_size=3)
        self.dropout2 = nn.Dropout(p=0.2)

        self.conv3 = nn.Conv1d(in_channels=32, out_channels=64, kernel_size=3, padding=1)
        self.relu3 = nn.ReLU()
        self.pool3 = nn.MaxPool1d(kernel_size=2)
        self.dropout3 = nn.Dropout(p=0.2)

        self.conv4 = nn.Conv1d(in_channels=64, out_channels=64, kernel_size=3, padding=1)
        self.relu4 = nn.ReLU()
        self.pool4 = nn.MaxPool1d(kernel_size=2)
        self.dropout4 = nn.Dropout(p=0.2)

        self.conv5 = nn.Conv1d(in_channels=64, out_channels=64, kernel_size=3, padding=1)
        self.relu5 = nn.ReLU()
        self.pool5 = nn.MaxPool1d(kernel_size=2)
        self.dropout5 = nn.Dropout(p=0.2)

        self.conv6 = nn.Conv1d(in_channels=64, out_channels=32, kernel_size=3, padding=1)
        self.relu6 = nn.ReLU()
        self.pool6 = nn.MaxPool1d(kernel_size=2)
        self.dropout6 = nn.Dropout(p=0.2)

        self.conv7 = nn.Conv1d(in_channels=32, out_channels=16, kernel_size=3, padding=1)
        self.relu7 = nn.ReLU()
        self.pool7 = nn.MaxPool1d(kernel_size=2)
        self.dropout7 = nn.Dropout(p=0.2)
        
        self.conv8 = nn.Conv1d(in_channels=16, out_channels=8, kernel_size=3, padding=1)
        self.relu8 = nn.ReLU()
        self.pool8 = nn.MaxPool1d(kernel_size=2)
        self.dropout8 = nn.Dropout(p=0.2)

        self.conv9 = nn.Conv1d(in_channels=8, out_channels=4, kernel_size=3, padding=1)
        self.relu9 = nn.ReLU()
        self.pool9 = nn.MaxPool1d(kernel_size=2)
        self.dropout9 = nn.Dropout(p=0.2)

        self.fc1 = nn.Linear(4996, 265)
        self.dropout13 = nn.Dropout(p=0.2)

        self.fc2 = nn.Linear(265, 128)
        self.dropout14 = nn.Dropout(p=0.2)

        self.fc3 = nn.Linear(128, 64)
        self.dropout15 = nn.Dropout(p=0.2)

    def forward(self, x):
        x = torch.squeeze(x, 3)

        x = self.conv1(x)
        x = self.relu1(x)
        x = self.pool1(x)
        x = self.dropout1(x)

        x = self.conv2(x)
        x = self.relu2(x)
        x = self.pool2(x)
        x = self.dropout2(x)

        x = self.conv3(x)
        x = self.relu3(x)
        x = self.pool3(x)
        x = self.dropout3(x)

        x = self.conv4(x)
        x = self.relu4(x)
        x = self.pool4(x)
        x = self.dropout4(x)

        x = self.conv5(x)
        x = self.relu5(x)
        x = self.pool5(x)
        x = self.dropout5(x)

        x = self.conv6(x)
        x = self.relu6(x)
        x = self.pool6(x)
        x = self.dropout6(x)

        x = self.conv7(x)
        x = self.relu7(x)
        x = self.pool7(x)
        x = self.dropout7(x)

        x = self.conv8(x)
        x = self.relu8(x)
        x = self.pool8(x)
        x = self.dropout8(x)

        x = self.conv9(x)
        x = self.relu9(x)
        x = self.pool9(x)
        x = self.dropout9(x)

        x = x.view(x.size(0), -1)   

        x = self.fc1(x)
        x = self.dropout13(x)

        x = self.fc2(x)
        x = self.dropout14(x)

        x = self.fc3(x)
        x = self.dropout15(x)
        return x

And the results I get for 10 epochs are:

Epoch 1, loss: 9.732
Epoch 2, loss: 5.282
Epoch 3, loss: 5.706
Epoch 4, loss: 5.364
Epoch 5, loss: 5.514
Epoch 6, loss: 5.412
Epoch 7, loss: 4.898
Epoch 8, loss: 5.174
Epoch 9, loss: 4.935
Epoch 10, loss: 5.077

Can you tell me how can I improve the model?
edit: The accuracy is 10% on test data with a percent of 80% trained and 20% test
and accuracy is 16% for the test on the trained data!!!

Your input size is 1440000. That is equivalent to a 700x700 color image, as far as size is concerned. 1.5 minutes for an epoch isn’t bad.

You could apply AvgPool1d and/or increase each MaxPool1d kernel_size to distill the sequence down more efficiently. With that many layers, you may run into the vanishing gradients problem using ReLU(). So you might want to consider using an appropriate activation layer that can self-regulate. See here: [2002.05202] GLU Variants Improve Transformer

Think of each of your channels coming out of your CNN as adjectives/nouns which describe to the fully connected layer the qualities of the input sound. “Minor/Major”, “tempo”, “dissonant/harmonic”, etc. If you’re running a classification problem, ideally, you want the CNN to bring the sequence size down to 1, yet have a lot of channels to describe the input. That’s because fully connected layers are awful at sequences(data where the order is as important as the value) but are great at logic.

You can refer to the PyTorch documentation on memory management for more details and specific implementation instructions. You should also consider checking the Colab Pro documentation or contacting their support for specific information about the memory resources available.