Each row data doesn’t have the same size. It looks like this:

Ideally, the shape of input data will be: `(batch_size, N, dim)`

But each row in the batch is not equal dimension. E.g: it can be `(k, dim)`

k < N

To feed to my model I have to add some fake rows called padded rows.

These inputs go through some functions, layers in my model.

In the end, the loss function takes the mean of each row as input → `(batch_size, dim)`

. But I don’t want to consider the padded row in the reduction.

→ So I computed the mean as:

- assigned all the padded rows to zero.
`torch.sum(..)/(number_of_non_padded_rows)`

My question is: As the padded rows is included in the parameter of function torch.sum(…), does the model try to modify the weights base on these fake (padded) rows?

My second assumption is because I assigned the padded row by a constraint value which is zero, then it somehow get rid of gradients of these padded rows during training?