Which One is the Best Way to Create Training Sequences for LSTM-based Class Prediction on Time-series Data?

Let’s say I have time-series data in the following way. I need to create training sequences of a fix length as an input to my LSTM model on PyTorch.

# Time, Bitrate, Class
  0.2,  312,     1
  0.3,  319      1
  0.5,  227      0
  0.6,  229      0   
  0.7,  219      0    
  0.8,  341      1    
  1.0,  401      2      

Once trained, the model should be able to predict classes of test data without classes. I am not sure how to create the training sequences. Let’s say I define a sequence_length = 3 for the data snippet above. I have two approaches:

  1. By shifting a window of sequence_length. In this case, the sequences:
# Sequence,                    Label of the Sequence
  [(312,1), (319,1), (227,0)]  0
  [(319,1), (227,0), (229,0)]  0
  [(227,0), (229,0), (219,0)]  0
  [(229,0), (219,0), (341,1)]  1
  [(219,0), (341,0), (401,2)]  2

The label of the first sequence is 0 (since the last element’s class is 0). However, this seems not okay, because even if the last label is 0, the former two are not.

  1. Fetching class by class up to sequence_length. if any of them is less than the sequence_length, I pad with the last element (since all tensors should have the same shape):
# Sequence,                   Label of the Sequence
 [(312,1), (319,1), (319,1)]  1   <- padded with the last element (319,1)
 [(227,0), (229,0), (219,0)]  0
 [(341,1), (341,1), (341,1)]  1   <- padded with the last element (341,1)
 [(401,2), (401,2), (401,2)]  2   <- padded with the last element (401,2)

Instead of right-padding with the last element, would zero padding from the left make more sense? If so, do I need to do something to make LSTM ignore those padded zeros?

Thanks!