I’m trying to calculate MSELoss when mask is used. Suppose that I have tensor with batch_size of 2: `[2, 33, 1]`

as my target, and another input target with the same shape. Since sequence length might differ for each example, I have also a binary mask indicating the existence of each element in the input sequence. So here is what I’m doing:

```
mse_loss = nn.MSELoss(reduction='none')
loss = mse_loss(input, target)
loss = (loss * mask.float()).sum() # gives \sigma_euclidean over unmasked elements
mse_loss_val = loss / loss.numel()
# now doing backpropagation
mse_loss_val.backward()
```

Is `loss / loss.numel()`

a good practice? I’m skeptical, as I have to use `reduction='none'`

and when calculating final loss value, I think I should calculate the loss only considering those loss elements that are nonzero (i.e., unmasked), however, I’m taking the average over all tensor elements with `torch.numel()`

. I’m actually trying to take `1/n`

factor of MSELoss into account. Any thoughts?