How to create a custom dataset for audio recognition

tiagoovieira · May 9, 2021, 9:57pm

Hi,

I have a question, I have a dataset of audiofiles that I’d like to convert into melspectogram and I want to use tourchaudio library to convert audio into a tensor directly. I’ve seen some people doing this by saving as an image, and I’d like to bypass that step, and train directly as a tensor. My question is, how should I do regarding, creating a Dataloader so that I can do this computational expensive operation and taking advantage of the fact that each folder is the label that contains all the training data for that class. The end goal is create a classification algorithm

Thanks so much!

ptrblck · May 10, 2021, 3:50am

You could create a custom Dataset as explained in this tutorial and apply the transformation on each sample in the __getitem__ of this class. To create the class indices based on the folder structure you could reuse the logic from DatasetFolder.

vincentqb · May 10, 2021, 2:30pm

Similarly to the previous answer, you can also checkout the audio classification tutorial and update the line tensors += [waveform] in collate_fn to tensors += [transform(waveform)] where transform is whatever transform you want.

If your goal is to apply the transform, save the transformed waveform to disk to avoid recomputing it later, and then create a new dataset with this, then you could also try to cache dynamically your dataset using something like diskcache_iterator.

kkoutini · May 10, 2021, 3:08pm

you can use something simple like this:

from torch.utils.data import Dataset as TorchDataset

class SpectrogramDataset(TorchDataset):

    def __init__(self,file_label_ds,  process_func, audio_path=""):
        self.ds= file_label_ds
        self.process_func = process_func
        self.audio_path=audio_path
    def __getitem__(self, index):
        file,label=self.ds[index]
        x=self.process_func(self.audio_path+file)
        return x, file, label
    def __len__(self):
        return len(self.ds)

file_label_ds is a dataset that gives you the file name and label.
process_func is a function that takes the full audio path and return the spectrogram.
PS: this helps you to extract the spectrograms in parallel using the CPU (if you have num_workers>0), this won’t work if you use the gpu to extract the spectrograms.

tiagoovieira · May 10, 2021, 9:09pm

Great stuff, truy appreciate the pointers, I did a rough implementation based on your suggestions! I have a small issue, and confirm if I’m thinking about this in the right way! I saw some examples where in the pre-process step, converts audio into images, but I’d like to convert straight into a tensor. And also, keep the dataset with the fix size. I have audiofiles with different durations. How can I accomplish this in the context of spectrogram?

If I understood correctly, there are two different ways I can create this dataloader, one is by creating a class, and overriding the getitem method, and the other is passing a collate_fn method. Is this a fair assessment?