Shape of input data to a LSTM with very long sequences


I’m new to pytorch and I have trouble to understand how my LSTM is working with different input shapes of my data.

My Data consist of signals where each sequence has a lengths of ~300 000. The sequence length differs between 5000 and 500 000 but manly the length is around 300 000. I try to solve a many-to-one task. So one sequence belongs to one class. In total I have 65 000 different sequences (each around 300 000 time steps long).

Sequence 1 (length: 300 000) belongs to class 0
Sequence 2 (length: 310 000) belongs to class 1
Sequence 3 (length: 290 000) belongs to class 2
Sequence 4 (length: 280 000) belongs to class 0

I was running my LSTM with two different input shapes, while batch_first = true:

(1,sequence,1) : Here I had to limit the sequence length to 500 to be able to train the network, I’m planning to use batches here, where one batch contains the howl sequence (600,500,1). I assume that the network is getting one value at a time but the training curve do not decrease.

(1,1,sequence) :
Here I padded or cut the sequence to have equal length (250 000) for all sequences
I assume that the network is getting the howl sequence at the same time, the training curve decreases.

Both implementations were working and giving some results (not good ones but results).

Could someone explain me the difference of how the network is working with the different input shapes?
Are both making sense?

Thanks for any help