Hi all, I’m looking for a little help sorting out the logistics of an idea I had.
I was initially training a CNN to recognize the number of speakers in an audio file that was converted to spectrograms. The way I accomplished this task on varying size spectrograms was to just resize the images.
However, I found this approach to be naive and anticipate problems with it when the audio files vary greatly in length.
My thought now is to take a single audio file, and chunk it into smaller equal sized pieces. Then feed this sequence to my CNN for feature embedding, and pass the feature sequence to an RNN.
My idea for a solution was to create a Dataset class that creates the sequence of smaller audio clips from and converts each chunk to a spectrogram image. Then I want to somehow trick the CNN into thinking it’s getting a batch of spectrograms.
This seems fairly easy to accomplish… unless I want to use a
DataLoader (I do). Then the dimensions of the data coming out of the
DataLoader will be wrong for input into the CNN. This is due to the
Does anybody have any clever ideas that would allow me to use a DataLoader so I can keep the multi-processing capabilities?