Consider a usage of BatchNorm1d, with batches and channel data, where the convolutional axis is time:

If the batch size is one, the shape of the input tensor is [1, C, T] and normalization proceeds as appropriate.

If the batch size is (say) eight, the shape of the input tensor is [8, C, T] and normlization proceeds under the assumption that all of the inputs are the same size and that ever value of the input tensor is valid and contributes to the statistics of the batch.

But this is not always a good assumption with temporal data-- we might have 8 sound files with different length, so that the input tensor might better be described as [8, C, T_max], where T_max is the T of the longest sample, and the other batch entries being padded/masked somehow.

Here is the rub: I cannot think of an appropriate padding/masking scheme. If one masks with 0, the normalization will assume that is actual data, which it is not. np.nan and np.inf are sticky and will convert the entire output to np.nan or np.inf, which is obviously not desirable.

Is there some way to convey to a normalizer that some data is masked and should not be considered for normalization, in a way which keeps batch parallelization and speedups on a GPU.

(Note: This is not a recurrent neural network application, it is a convolutional application. I am really looking for a way to stay in the convolutional format.)