Hi PyTorch forum! I am not very familiar with LSTM, so please bear with me on this.

Assuming that we are working with sequential data with variable input length and a batch size of two.

e.g.

[1, 1]

[1, 1, 1, 1, 1]

Now I pad these inputs to be of the same length, resulting in

[1, 1, 0, 0, 0]

[1, 1, 1, 1, 1]

I need to write a customized loss function, and the way I wrote it, the loss function would produce a loss of 0 whenever it encounters elements that are padding elements (0 in this case). So the loss for one batch would look like this:

[2, 3, 0, 0, 0]

[4, 1, 2, 6, 7] (all non 0 numbers are just made-up numbers)

I then sum them together and take the average by dividing the batch size, which is 2. The model is updated using this loss value, so the model only gets updated when all the sequences are processed entirely in this batch.

It suddenly occurs to me that now input with a shorter length will drag down the loss because it’s much shorter and will naturally have a smaller loss in most cases. So the average loss of this batch may not be very reflective of the performance

Is this a theoretically sound way to do it? Is there a better approach to handle this variable input scenario? Thank you so much!