Framewise Audio Data Loader for large Audio Corpus

I am trying to create a data loader for audio dataset. I have a bunch of audio files and those are listed in a csv file. To create a data loader, I need to inherit Dataset class and implement getitem and len methods. I want to load and process audio data on the fly and additionally my DNN model is not sequence wise. I need to load a set of audio files pre-process it and divide it into frames of constant size. At the time of input(to DNN) I need to take a minibatch of audio frames(not whole audio sequence).

getitem method takes an index and return the data frame corresponding to the index. All my audio file paths are in a CSV file and I want the Dataset loader to input it and load, pre-process and divide it into frames on the fly.

What do i do to make ‘index’ variable of getitem correspond to audio data frames ?

Please help.


You can check yesno dataset implementation for a general idea about how to build a custom dataset.

Regarding paths in csv, you can maybe create a preprocessed file which contains the audio files as tensors similar to how they have done it in here.

Well a simple trick is to do you own calculations and set your own length in the __init__ of you dataset class, and return that length in the __len__ method. And not worry about the index argument. So the dataloader will call your __getitem__ length-1 times and you can randomly pick the frames you want from your data, irrespective of the index value.

As for trimming the audio, normalizing, stft calculation, etc, you can use any of the available transforms, or write your own transforms which you can pass to your custom dataset.