Constructing a sequential dataset from several audio files

Hi all,
I’m trying to build a Voice Activity Detection network in Pytorch using LSTMs.
As far as I understand, I’d wanna use LSTM and not LSTMcell so that i could just pass the entire audio signal, and the output at each step would be one value indicating “speech” or “not-speech”.
My problem is that my dataset is constructed of several audio files. if I just load and concatenate all the audio files there will be places where the LSTM will “see” the end of one audio file and the beginning of another, which is problematic (the data looses its sequential meaning).
what I want is a way to make the LSTM go over all the sequence of one audio file but when it reaches the end to skip over to the beginning of the next file.
is there any way to do this? pointers and hints are very welcome. (code even better :slight_smile:)

and one more thing - how does shuffling the data work for LSTM training?
if i just feed the same audio files in each epoch, won’t the network overfit the training data?


You probably want to decide how you want to work on the audio (the signal itself, convolutions, spectrograms, mfccs). You could manipulate the audio a bit doing small pitch changes, add noise, volume changes, etc to “shuffle” the data. I would look into how data is loaded into text classification models / seq2seq models and then try to manipulate your audio loading accordingly.