Correct function for BatchNorm2D

sonnguyen · March 27, 2019, 8:27pm

I’m trying to get the correct function of batchnorm2d.
Suppose my model is the following

class Net(nn.Module):
def init(self):
super(Net, self).init()
self.conv1 = nn.Conv2d(1, 32, 5,padding=2)
self.pool = nn.MaxPool2d(2, 2)
self.bn1 = torch.nn.BatchNorm2d(32)
self.fc1 = nn.Linear(321414, 10)
def forward(self, x):
    x = self.bn1(self.pool(self.conv1(x))) 
    x = F.relu(x)
    x = x.view(-1, 32 * 14 * 14)
    x = self.fc1(x)
    return x
net=Net()
tem1 = net.pool(net.conv1(x))

Could you please tell me what is the equivalent of below equation?

bn = net.bn1(tem1)

The most closed equation I can have is

net.bn1.weight.view([32,1,1])*((tem1 - net.bn1.running_mean.view([32,1,1]))/((net.bn1.running_var.view([32,1,1]) + net.bn1.eps).sqrt())) + net.bn1.bias.view([32,1,1])

But still it has a slight difference with net.bn1(tem1)
Many thanks.

Kushaj · March 27, 2019, 8:55pm

What do you mean by equivalent? As far as I understand, you want to simplify

tem1 = net.pool(net.conv1(x))
bn = net.bn1(tem1)

Which is just a combination of conv->pool->batchnorm

sonnguyen · March 27, 2019, 8:59pm

I want to understand what is the exact implementation inside the batchnorm2d. So I’m trying to figure out the exact equation inside the net.bn1(tem1). Equivalent means they should have the same output.

InnovArul · March 27, 2019, 10:26pm

I think, running_mean and running_var are computed during training, however they are used for normalization only during testing. During training, the batch statistics (mean and std) are directly used for normalization. @ptrblck Is that right?.

As far as BatchNorm2d computation is concerned, the following works:

bn_in = ... # input tensor for batch norm

# squeeze spatial dimensions
bn_in_flat = bn_in.view(bn_in.shape[0], bn_in.shape[1], -1)

# we need mean and std per channel across whole batch. so permute the batch dimension and flatten
bn_in_flat = bn_in_flat.permute(1,0,2).view(bn_in.shape[1], -1)

# calculate mean
bn_in_mean = bn_in_flat.mean(dim=-1).view(1, bn_in.shape[1], 1,1)

# calculate unbiased std
bn_in_std = bn_in_flat.std(dim=-1, unbiased=False).view(1, bn_in.shape[1], 1,1)

# batch norm 2D computation
bn_out = ((net.bn1.weight.view(1, out_channels, 1, 1) * 
            (bn_in - bn_in_mean) / (bn_in_std**2 + net.bn1.eps).sqrt()) + 
         net.bn1.bias.view(1,out_channels,1,1))

ptrblck · March 27, 2019, 10:32pm

Yes, you are correct. The running estimates are used during eval. If track_running_stats is set to False, the batch statistics will be used during training and eval.

InnovArul · March 27, 2019, 10:36pm

A small correction/clarification is needed. track_running_stats is set to False?

ptrblck · March 27, 2019, 11:40pm

Yes, thanks for catching this typo!

sonnguyen · March 28, 2019, 1:44am

Hi all, thank much for your help. I’m able to generate the same output now.

Taro_Kiritani · November 29, 2020, 2:01am

I am wondering why running_mean and running_var are not used during training. Intuitively, that would make the model more stable, and the convergence would be faster.

ptrblck · November 30, 2020, 1:28am

These running stats are updated during training and are thus set to the default values of 0s for the running_mean and 1s for the running_var. Using them during training without updating wouldn’t normalize the data.

Also, even if you update the stats during training but use them as well, I would assume your model might yield quite a bad performance, since the underlying mean and var might differ from the default values and the model would be trained on “unnormalized” data at the beginning.
Batchnorm layers deeper in the model would also get activation values, which were not created using normalized data, so your complete model could also break.

EDIT: just a quick update:
as you see there are a lot of "could/should"s in my post, so don’t let my post stop you from experimenting with this approach, if you think it can work fine. Also, please update me once you’ve run some experiments.

Taro_Kiritani · January 4, 2021, 3:52am

After some research, I found this approach (a bit modified version) has been used in batch renorm. It is not included in pytorch yet, but some third party repos like this one are available.

sigma_x · January 25, 2021, 5:06pm

So is it correct that at test/eval time, only a single mean/std used, not the whole history that running_mean and running_var store?

ptrblck · January 25, 2021, 9:53pm

The running stats are not storing a history of the values, but are updated to a single value as described in the docs using the current batch stats, the running stats, and the momentum.

sigma_x · January 28, 2021, 11:40am

OK, so there are 64 values in running_mean and running_var. Which ones are used at test time, with model.eval()?

ptrblck · January 28, 2021, 7:21pm

All of these values will be used, since the running_mean and running_var contain num_features values, which corresponds to the number of input channels.

sigma_x · January 29, 2021, 8:51am

OK, I think I got it: during training only the current per-channel mean/std are used (+scale and shift factors), at test time the history (also per-channel).