My task is to take an episode of a TV show and its subtitles. Then make the subtitle timings more accurate (from 200ms to 20ms). So I want to learn what is speech and what is not.
I’ve now taken the audio, converted it into a spectrogram and separated each column of the spectogram to be a single data item. So now I have two arrays:
print(train_speech.size()) # torch.Size([93482, 201]) print(train_silence.size()) # torch.Size([35038, 201])
All I want to do is a simple multi-linear NN to make a difference.
train_speech is FFT’s of people talking and
train_silence is no talking (used subtitles for the distinction).
My question is what DataLoader can I use to take these into torch?