The right way to perform sequence classification with RNN using a sliding window?

Hello,
I want to use an RNN (specifically LSTM) in order to classify a sequence.
The input is a simple sequence with 1000 datapoints (think audio signal), I want to use an RNN in order to classify the sequence and I want to do it with a “sliding window approach”.

An example size of an input will be [64,1000] where 64 is the minibatch size and 1000 is the sequence length (64 samples of sequences each 1000 datapoints in length)

Previously I tried manually looping in the forward function each time sending the LSTM a part of the sequence.
I did it without a sliding window approach, so I just separated the sequences by 10 and sent them to LSTM.
I also added a dimension in order to be able to insert it into the LSTM.
So here the input to the LSTM was [1,64,100]. On each loop I took the hidden state from the previous loop and fed it to the next LSTM.
I was able to overfit the network on a few samples (to check for bugs) but I was unable to converge for the bigger problem.
I think there is a flaw in my design because I input to the LSTM a “part” of my sequence at a time, but I believe I should’ve inputted all the parts at once in the first dimension.
Also, I was thinking of a sliding window (where the LSTM would evaluate the sequence with a sliding window instead of non-overlapping parts one at a time)

Now I’m interested if you can suggest a better solution to my problem.
First, I transformed the input:
lets look at a simple example:
minibatch: 2 sequence_length:11 sliding_window_width: 5 sliding_window_stride: 2
The input would be [2,11] and I want to transform to [4,2,5]
(a short explanation: for each sequence (of length 10) I want to create “an array” which consists of the “sliding windows” of that sequence.
a sequence [0,1,2,3,4,5,6,7,8,9,10]
would transform to [[0,1,2,3,4],[2,3,4,5,6],[4,5,6,7,8],[6,7,8,9,10]]
I did this with this loop (I did this in the forward function and it’s performed on each minibatch):

input = torch.from_numpy(np.array([[0,1,2,3,4,5,6,7,8,9,10],[10,11,12,13,14,15,16,17,18,19,20]]))
splitInput = Variable(torch.zeros(4, 2, 5))
for i in range(4):
splitInput[i] = input[:,i2:(i2+5)]

So now I have (what I think is) a correct format to input to the LSTM which is
"(seq_len, batch, input_size)"
which is (as I understand it):
(“number of time steps/num of sliding windows”, batch, "actual sequence length/width of the sliding window)
Am I correct? Is there a better way to perform this transform? maybe with np.hstack but I couldn’t find a way.

I tried inputting it into an LSTM, the output from it is [“num of sliding windows”, batch, hidden_size].
On a side note, the LSTM required me to initiate it with nn.LSTM(width_of_sliding_window, hidden_size, num_of_layers) while in the design before I could input whatever I wanted in the first argument.
After the LSTM I feed the output (of size [“num of sliding windows”, batch, hidden_size]) to a Linear layer with the sizes of "nn.Linear(num_of_sliding_windowshidden_size, num_classes)". But I need to transform it first in order to use it in the linear layer, I perform the next transformation "output.transpose(0,1).contiguous().view(batch,-1)"
This transform results in a tensor of size: [batch, num_of_sliding_windows
hidden_size]
After the linear layer I’m left with an output of [batch, num_classes]
Then I used cross entropy as the loss function.
I tried over-fitting the network on a few samples and expected to get to a minimal loss because the network would “remember” the samples. Unfortunately I was not able to over-fit, which probably because there is a bug in my design. Can you spot a problem with the design? maybe some of the later transforms mixed the samples? maybe the design it’self is flawed? maybe I don’t understand the actual required input to the LSTM and it doesn’t work?
EDIT: actually I was able to over-fit, I just increased the num_hidden… still any suggestions would be appreciated.
Any help with the current design will be appreciated.

Also, if you know or think of a better design to solve this type of problem I would be really happy to hear that.

Thanks.

p.s- I’m relatively new to pytorch and DL so please don’t be mad if I understood something wrong, or very very wrong… I would appreciate any explanations or links explanations.

p.s.s- I also searched for any example of pytorch/rnn/sequence and most of the examples were either for video of for text (which is very different from a simple sequence like audio) and I tried implementing what I understood from the examples in the first design I wrote about.