In order to verify identical behaviour with the nn.BatchNorm equivalent, I initiate 2 models (as well as 2 optimizers), one using MyBatchNorm and one using nn.BatchNorm. Although I can verify that in the beginning, “compareBN” returns identical results (Max diff: tensor(8.5831e-06, device='cuda:0', grad_fn=<MaxBackward1>)), as the training progresses, very tiny differences begin to emerge.

By the end of the 25 epochs, both the models and the results differ significantly. Note that I have set all random seeds to 0 (including deterministic behaviour for cudnn and cuda).

This reminded me of fp16 vs fp32 training. Can it be the case that the underlying implementation of nn.BatchNorm uses different precision? Or is it something else? The MyBatchNorm method is also slightly slower (+2 seconds on a 10 second epoch), but I’d assume that this is due to C++ being used for nn.BatchNorm. I used a 2 layer FCNN with 2 BatchNorm layers on MNIST. Interestingly, the 2nd BatchNorm layer is slower in showing disrepancies.

I was able to get the models (predictions as well as the batchnorm params) closer to each other by using double precision (model.double()). In fact, their differences now have been stabilised around 10^-10 - 10^-11 (based on the output of print('Max diff: ', (outputs - outputs_).abs().max() and compareBN), whereas before it would at some point escalate to 0.5-1.5 and therefore end up with 2 completely different models.

The downside is that it now takes longer to train due to the double precision. If anyone has any recommendation as to how to ameliorate the problem without the overhead, I am all ears

My manual implementation was originally intended to be used as a reference implementation to see, how the running estimates are updated and how the batchnorm layer works in general.
The performance difference is expected, as I used native methods without specific (and faster) kernels.

The difference in after a couple of iterations is most likely created due to the accumulation of rounding errors created by the limited floating point precision.
It’s also an indication for this, since you see a reduction in these errors using float64.

What’s your use case, that you would like to use this implementation in a real model?

Could you be more specific with what you mean with faster kernels :)? I’d be interested to look into that. I am trying to implement ghost batch normalization (in essence, estimate batch statistics based on smaller sample sizes than the original mini-batch) for research purposes.

Other than the speedup one would gain from adopting those functions, do these functions also facilitate determinism? When I don’t use cuda(), I can reproduce the experiments, however, when I use cuda() and my own implementation of BatchNorm, it is not reproducible

Have a look at the Reproducibility docs, if not already done.
What exactly did you change in the implementation, as I’m able to get deterministic outputs just be seeing the code with my code snippet.

What exactly did you change in the implementation, as I’m able to get deterministic outputs just be seeing the code with my code snippet.

You are right, I modified the implementation back to yours, and it’s reproducible. It’s very strange because the nondeterminism seems to arise from rounding errors that do not happen in the same way, i.e. on 2 different runs I get:
Epoch 1/2, training loss: 0.2351639289064084, validation loss: 0.09888318485801263
Epoch 1/2, training loss: 0.23525172622356727, validation loss: 0.10040923198367216

I’ve added the following: reshape input tensor, take the mean & var across the newly introduced dimension, and then use repeat_interleave on mean & var. Finally, do the (input - mean) / sqrt(var) calculation.

Ah OK.
The backward pass of repeat_interleave is not deterministic as explained in the linked docs:

Additionally, the backward path for repeat_interleave() operates nondeterministically on the CUDA backend because repeat_interleave() is implemented using index_select() , the backward path for which is implemented using index_add_() , which is known to operate nondeterministically (in the forward direction) on the CUDA backend (see above).

I’m also trying to reimplement BatchNorm2d based off your provided approach and getting similar rounding errors. Do you have any recommendations for a way to completely remove these differences?

My use case is I’m trying to separate the Batch Norm layer into two separate parts - mean/var (“head”) and weight/bias (“tail”). Ideally, I would like to have each part as a separate class that extends nn.Module and self.tail(self.head(x)) would be the exact same as doing self.batch_norm(x). You could also think of the “head” as doing nn.BatchNorm2d(num_features, affine=False), but I just need a way to do the affine transformation without any precision errors from the normal method.

If you are seeing absolute errors in the range 1e-5 they would be created by the limited floating point precision and the order of operations.
I don’t think you can easily guarantee the order of floating point operations in the manual approach to match the rounding errors in the internal PyTorch version.
If you need more precision, you could use .double() types for the data, buffers, and parameters.