My task is to take an episode of a TV show and its subtitles. Then make the subtitle timings more accurate (from 200ms to 20ms). So I want to learn what is speech and what is not.
I’ve now taken the audio, converted it into a spectrogram and separated each column of the spectogram to be a single data item. So now I have two arrays:
All I want to do is a simple multi-linear NN to make a difference. train_speech is FFT’s of people talking and train_silence is no talking (used subtitles for the distinction).
My question is what DataLoader can I use to take these into torch?
There is one DataLoader, which accepts a Dataset and provides different functionalities such as shuffling, creating batches using multiple workers, etc.
To create a custom Dataset you could have a look at this tutorial.
What I don’t get is that my data is already a simple tensor… It doesn’t make sense to me that I need to create a separate abstraction just to fetch numbers from a few arrays…