I have music data which contains raw vocals of some indian classical and devotional songs and I have segmented each file into 20 seconds files. I have used librosa.load function with a sample rate of 3000 to load the audio file. The output returned by the function is of type ndarrray and I have converted it to pytorch tensor of type float and reshaped it to(1,60001). Each audio file is labeled. I want to do the classification task using lstm. If I consider batch_size as 32 then input will be of shape(32,1,60001). I want to know what is the input size and sequence length in this case. Considering audio as a single channel input will the input size be 1 and sequence length be 60001.The model ran faster each epoch when the input size was 60001.Model ran very very slow when the input size was 1.Which is the correct way of considering the input size and sequence length of the lstm model. Can lstm be trained using raw audio files for classification task??
ya ya it can be trained.