Batch Normalization Over Sequence Does not Generalize to Validation

I have videos of variable lengths, where each frame is represented as a feature vector I create of size 1000.
With a batch size of 16, my tensors are in the shape (16, 1000, LEN) - where LEN is the maximum length of a video in this batch.

If I instantiate a batch norm considering the 1000 dimensions as different channels:

self.batch_norm = nn.BatchNorm1d(1000)

and in the forward-pass run self.batch_norm(tensor), my training_loss goes down, but my validation_loss stays stagnant - and thus, 0% accuracy for my task.

Without using the batch norm, my model trains well - validation_loss goes down, and accuracy is 65%~

Perhaps this problem emerges from the padded sequences, where if I batch a video of 10 frames, with a video of 20 frames, I create 10 frames of zeros.

To check that, instead of padding with zeros, I tried padding with the last frame of that video, so it is a valid frame. However, the results are the same.

The batch size might be too small, if you are dealing with different distributions and the batchnorm layers might estimate noisy running stats.
Would it be possible to increase the batch size and lower the sequence length for the sake of debugging?
Alternatively you could also change the momentum to weight the updates a bit less.

Thanks @ptrblck
Playing with momentum did help get this training correctly, although after sweeping through a large range of momentums, the best model on the validation set is without using batch norm at all.

Thanks again, consider this solved