How does the batch normalization work for sequence data?

pp18 · November 29, 2018, 4:36am

I have sequence data going in for RNN type architecture with batch first i.e. my input data to the model will be of dimension 64x256x16 (64 is the batch size, 256 is the sequence length and 16 features) and coming output is 64x256x1024 (again 64 is the batch size, 256 is the sequence length and 1024 features). Now, if I want to apply batch normalization should it not be on output features which is 1024 but the problem is pytorch does not allow this or I do not completely understand. From this doc it is clear from the following statement

`torch.nn.BatchNorm1d`: 
Parameters: num_features – C from an expected input of size (N,C,L) or
                           L from input of size (N,L)

it is clear for 2D data that batch-normalization is executed on L for input size(N, L) as N is incoming features to the layer and L is outgoing features but it is confusing for 3D data which I believe should also be L.

Please someone who has used batch-normalization for 3D data.

Any help is very much appreciated.

Thank you for all the help.

ptrblck · November 29, 2018, 2:44pm

If you would like to use the feature dimension in batch norm, you could simply permute your input:

bn = nn.BatchNorm1d(1024)
x = torch.randn(64, 256, 1024)
x = x.permute(0, 2, 1)
output = bn(x)

The BatchNorm1d layer will now have 1024 running estimates.

pp18 · November 29, 2018, 4:15pm

@ptrblck thanks a lot. I feel so stupid for not thinking that way.

Since we can always permute the data to fit according to our need but still I would like to know what is the idea for this default setting in case od 3D data shape to normalize on the middle dimension.

ptrblck · November 29, 2018, 4:22pm

I guess it’s just the choice to provide a similar API for 1D, 2D and 3D cases, which all use dim1.

Yasser · May 11, 2020, 2:18pm

Hello, I just came across this topic because I’m actually trying to do batch normalization for multivariate time series data, and I did it on the features following exactly the same method you described :
bn = nn.BatchNorm1d(1024)
x = torch.randn(64, 256, 1024)
x = x.permute(0, 2, 1)
output = bn(x)

However I applied this in the forward method of the LSTM class I created, and therefore I give it as an input to the LSTM, the results are just perfect for me, but I still have one question :

Since Batch normalization applies to each layer in the LSTM I have the feeling it is not the case following what I just did, because I just add a few line in the forward method of the LSTM, and I don’t know it really applies to each layer or just the input, because it looks like I only apply it to the input.
Here’s the part I added in the forward method :

Capture d’écran 2020-05-11 à 16.17.30950×350 23.3 KB

ptrblck · May 12, 2020, 3:07am

In your code snippet the batchnorm layer will be applied to features only before they are passed to self.lstm.
What is your use case? Would you like to apply batchnorm layers after each layer in a multi-layer LSTM?

Yasser · May 12, 2020, 10:25am

In my case I am working with multivariate time series data, and just like you said, I want to use batchnorm layers after each layer in the multi-layer LSTM.
I’ve did some research to find how to do it, but it says it’s not possible with the normal LSTM.
The only way to do it is to modify the LSTM so that a recurrent batchnorm (Article) can be applied, the implementation of the modified LSTM has been given in this Github repo.
Do you confirm what I’ve just said or there’s a simpler way to do it ?
Thank you !

ptrblck · May 12, 2020, 10:30pm

The repository looks alright and I also wanted to suggest to use LSTMCell instead.

Yasser · May 13, 2020, 9:49am

Ok, Thank you very much

Maghoumi · June 19, 2020, 12:17am

If the output of the LSTM is a PackedSquence, what’s the correct way of applying BatchNorm1d?

ptrblck · June 19, 2020, 7:21am

I think you would have to apply the batchnorm layer on each input separately or pad the input sequences to the same length. PackedSquence would contain inputs with different lengths, so the batchnorm layer wouldn’t be able to process this input together.

Maghoumi · June 19, 2020, 8:19am

or pad the input sequences to the same length

You mean use pad_packed_sequence on the PackedSquence, then apply the BatchNorm?
Something like this?

# Assuming 'input' is a PackedSquence...
output, _ = lstm(input)
output, _ = pad_packed_sequence(output, batch_first=True)
output = bn(output.permute(0, 2, 1)).permute(0, 2, 1)

Maghoumi · June 19, 2020, 11:39pm

Hmm… Following this discussion, I think the correct way of applying BatchNorm is this (please correct me @ptrblck if I’m wrong):

def simple_elementwise_apply(fn, packed_sequence):
    """applies a pointwise function fn to each element in packed_sequence"""
    return torch.nn.utils.rnn.PackedSequence(fn(packed_sequence.data), packed_sequence.batch_sizes)

# Assuming 'input' is a PackedSquence and bn = BatchNorm1d(....)
output, _ = lstm(input)
output = simple_elementwise_apply(bn, output)

ptrblck · June 20, 2020, 7:43am

That looks correct, yes. (I also always trust @tom’s solutions )