How do I allow my neural network to train on an audio stream? In particular, what I want to achieve is to take in, for example, an audio that is 4 seconds long and pass a sliding analysis window wherein I classify the audio in every window and return the class with the highest average probability in all the windows. I’m not sure how to do this in PyTorch. Does this have something to do with the data loaders? Or is it more on the design of the network?
A classic solution in this case would be to simply pad the audio to the longest length possible.
As I mentioned in another answer, @egy’s answer is correct to pad to the longest length possible and you can take the example of DataLoaders from Nvidia’s Tacotron2 source code here.