RuntimeError: running_mean should contain 32 elements not 1024?

I get the error described above when attempting to perform batch norm on my network

The isolated code for the section that fails to run is

            print(layer_f)
            print(layer_b)
            x = self.af(layer_f(x))
            print(x.shape)
            x = layer_b(x)

The print out for the objects are

Linear(in_features=84, out_features=1024, bias=True)
BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
torch.Size([11, 32, 1024])

From what I understand I am passing my 11,32,84 tensor into the linear layer to get a 11,32,1024 tensor then pushing that tensor through to a batch norm layer of size 1024. I do not understand what has gone wrong here?

The input dimensions are [Batch,channel,element] and the number of channels changes for each input.

As you’ve already described the number of channels in the input activation to the batchnorm layer doesn’t match the expected channels, so you would have to permute the activation via:

x = x.permute(0, 2, 1)

before passing it to layer_b.

If I do this wouldn’t things be normalized with respect to the channels rather than to the features or have I misunderstood something?

Permuting the output of the linear layer would assign the out_features dimension to dim1, which is the “channels” dimension in batchnorm layers.
I assumed this is the expected use case, since @Michael_Moran defined the size of these dimensions both as 1024. dim1 in the original output of the linear layer (size of 32) is the “additional” dimension and is often used e.g. for the temporal dimension.

If this approach is correct or not depends on the use case and might be wrong for you so you would have to explain a bit what you are trying to achieve.

In the end I got the desired behaviour by doing the following snippet

i,j,k = x.shape
x = x.view(i*j,k)
x = batch_layer(x)
x = x.view(i,j,k)

This seems to have worked where the usage case was that I have a minibatch of sets / unordered sequences and I wanted the embedding of each token in these sets / sequences to be normalized in the same way with the same parameters.

This isn’t an NLP task but the language is appropiate so I will use it.

I wanted to normalize each word token consistently with the same parameters. In the context of the normalization the batch of sentences where each sentence contained words was abstracted down to just being a large batch of words. I was unsure how batch_norm1d would actually handle things using the permutation solution.

I apologise if I’m unclear as I’m unsure how batch_norm1d behaves with 3d tensors.