I’m trying to build a Voice Activity Detection network in Pytorch using LSTMs.
As far as I understand, I’d wanna use LSTM and not LSTMcell so that i could just pass the entire audio signal, and the output at each step would be one value indicating “speech” or “not-speech”.
My problem is that my dataset is constructed of several audio files. if I just load and concatenate all the audio files there will be places where the LSTM will “see” the end of one audio file and the beginning of another, which is problematic (the data looses its sequential meaning).
what I want is a way to make the LSTM go over all the sequence of one audio file but when it reaches the end to skip over to the beginning of the next file.
is there any way to do this? pointers and hints are very welcome. (code even better )
and one more thing - how does shuffling the data work for LSTM training?
if i just feed the same audio files in each epoch, won’t the network overfit the training data?