MFCC extracterted by librosa PyTorch

I am extracting MFCC Features and saving them as (PNG) using matplot for audio classification using librosa. However, saving these images is taking long time.
Moreover, when I used the pre-trained weights of Efficient-Net_v2_m I am getting great results during the validation and on unseen data.
I have two questions:
Q1- How can I save the MFCC images faster?

This is how I am generating the MFCC images

 y,sr = librosa.load ('Down/test.wav')
MFCC = librosa.feature.mfcc(y=y, sr = sr)
img = librosa.display.specshow(g)
plt.savefig("Down/out.png", bbox_inches='tight', pad_inches = 0)

And then I am splitting images to training and testing for k-fold and unseen for checking the accuracy of the model at the end

def train (model, optimizer,cost, epochs, dataloader):
    total_loss = 0.0
    for epoch in  tqdm(range(0 , epochs), colour="yellow"):
        print (f'******  Starting Epoch {epoch+1}:  ******')
        for i, data in enumerate (dataloader):
            inputs= data["image"]
            targets = data["target"]
            inputs =, dtype=torch.float)
            targets =
            outputs = model(inputs)
            loss = cost (outputs, targets)
            if i % 100 == 0:
                    total_loss, current = loss.item(), i * len(inputs)
                    print(f'loss:{total_loss:>7f} [{current:>5d}/{len(dataloader.dataset):>5d}]')
    print('Training process has finished. Saving trained model.')

def val (fold, model,dataloader, results ):
    print('----  |||| Starting Validation |||| ----')
    save_path = f'models/model_fold_No_{fold+1}.pth', save_path)
    correct, total = 0, 0
    with torch.no_grad():
        for i, data in tqdm (enumerate(dataloader), colour="green"):
            inputs= data["image"]
            targets= data["target"]
            inputs =, dtype=torch.float)
            targets =
            outputs = model(inputs)
            _, predicated = torch.max(, 1)
            total += targets.size(0)
            correct += (predicated == targets).sum().item()
        print('Accuracy for fold [%d]: %d %%' % (fold, 100.0 * correct / total))
        results[fold] = 100.0 * (correct / total)   

Q2- is my accuracy and k-fold validation is correct?

Please find training and validation information below:
MFCC was used instead of Spectrogram
10 Epochs
10 Folds


Fold 0: 99.75 %
Fold 1: 100.0 %
Fold 2: 99.92857142857143 %
Fold 3: 99.96428571428572 %
Fold 4: 100.0 %
Fold 5: 100.0 %
Fold 6: 100.0 %
Fold 7: 99.96428571428572 %
Fold 8: 99.96428571428572 %
Fold 9: 99.96428571428572 %
Average: 99.95357142857145 %
**(base)$ python3 src/ **
$$$$$$$$$$$$$$$ Model Evaulations: $$$$$$$$$$$$
---- |||| Starting Validation |||| ----
Please Wait …: 100%|█████████████████████████████████████████████████████████████████████████████| 500/500 [00:36<00:00, 13.64it/s]
Accuracy for Model resulted from Fold [1]: 99.83 %
---- |||| Starting Validation |||| ----
Please Wait …: 100%|█████████████████████████████████████████████████████████████████████████████| 500/500 [00:35<00:00, 13.94it/s]
Accuracy for Model resulted from Fold [2]: 99.97 %


I would appreciate any thoughts or ides

Thanks in advance

@ptrblck Can you provide your feedback. Kindly

You could check different image libraries, such as PIL, OpenCV, etc. and compare them to your current approach using matplotlib. However, the speed also depends on the write speed of your SSD which might already be the limiting factor so changing the software stack might not give you any speedup.
I also assume this task has to be done once, so I’m unsure if it’s worth optimizing it assuming it doesn’t take days.

I don’t see any k-fold CV logic in your code besides passing a fold argument to val and printing it.

1 Like

Than you very much for your reply.

Please find the code of kfold below, and confirm if I am doing it right

 k_folds = 10
    print(f'device is {device}')
    results = {}
    epochs = 30
    df = pd.read_csv('input/data/train_MFCC.csv')
    df = df.sample(frac = 1).reset_index(drop = True)
    kf = StratifiedKFold(n_splits=k_folds, shuffle = False)
    for f, (t_, v_) in enumerate(kf.split(X=df, y = y)):
        df.loc[v_, 'kfold'] = f
    #print (df.kfold.value_counts())
    for fold_ in range (k_folds):
        model = models.efficientnet_v2_m(weights = 'DEFAULT')
        model.classifier[1] = nn.Linear(1280, 7)
        cost = torch.nn.CrossEntropyLoss(label_smoothing = 0.11)
        learning_rate = 0.0001
        optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
        train_df = df[df.kfold != fold_].reset_index(drop = True)
        test_df = df[df.kfold == fold_].reset_index(drop=True)
        train_set = MyDataset(train_df['path'].values, train_df['class_num'].values)
        train_loader = DataLoader(dataset= train_set, batch_size=16, num_workers = 4)
        test_set = MyDataset(test_df.path.values, test_df.class_num.values)
        test_loader = DataLoader(dataset= test_set, batch_size=16, num_workers = 4)
        print (f'FOLD {fold_}:')
        train(model, optimizer, cost, epochs, train_loader)
        val(fold_,model, test_loader, results)

Q1:I am still tries to confirm if my logic is correct,

q2: My loss is changing from 2.0 to 0.5 when a batch_size = 16, and the resulted fold0 is seems to reach 99%, I would appreciate any thoughts.

Thank you very much

Your code is not executable so I cannot verify it’s working correctly, but at least I cannot find anything obviously wrong. Your code snippet also seems to stick to the StratifiedKFold example besides a few changes using pandas etc.

The high accuracy could mean that the used dataset might not be too hard to train an and your model(s) seem to be able to quickly learn the important features(s).

1 Like

Thank you very much for quick response. I am extracting the MFCC Features and passing them to the CNN, Moreover, one last question, when I wanted to double check by extracting MFCC using TorchAudio, I am not getting the same output image.

Librosa MFCC image is as follow:

And I am trying to set the same default parameters to pytorch but with no luck

Librosa Code:

y,sr = librosa.load (file_path)
S = librosa.feature.mfcc(y=y, sr=sr)
 img = librosa.display.specshow(S, sr=sr)
 plt.savefig(f'image.png', bbox_inches='tight', pad_inches = 0)

I have tried using torchaudio, but it is not giving the same results:

    waveform, sample_rate = torchaudio.load (file_path)
            spectrogram_tensor = torchaudio.transforms.MFCC(sample_rate = 22050, n_mfcc = 256,   melkwargs={
                "n_mels": 256,
                "n_fft": 2048,
                "win_length": None,
                "mel_scale": "htk",
            }, )(waveform)
            plt.imsave (f'test.png', spectrogram_tensor[0,:,:].numpy(),  vmin = -80, vmax=0, origin="lower", cmap='viridis')

Thank you

What are the differences between these outputs?

For the same wav File
Librosa Result is:

TorchAudio Result:

It seems as if the number of frequency buckets is lower in the torchvision results, so you might need to check the used parameters again.

1 Like

Alright, I double check, Thank you very much for your support

Also, check the image dimensions first as I cannot check them in the posted images and see if the two images might contain very similar data but are just scaled to a different figure layout or so.

Well , what I have noticed that the number of n_mfcc parameter (n_mfcc = 256 for pytorch) is determining the height of image and I am not sure how the width is calculated in pyTorchAudio,

however in librosa, the size is fixed for all images 496 x 369

I am not really sure what exact parameters I need to consider, cause for example the hop_length is not mentioned in librosa, and in PyTorch it set as n_fft if I am not wrong