Modifying BatchNorm source code


I’m wanting to modify the PyTorch C/C++ source code for Batch (and Group, Layer, etc.) Norm layers for part of my research, which could hopefully result in a contribution to PyTorch if successful and the work is substantial.

I’m wondering what files I should look at for modifying? The hope is I can do something like nn.BatchNorm2d(num_features, my_extra_parameter=True) and it would pass my extra parameter down to ultimately where the normalization and (optional) affine transformation is computed in the code. If my_extra_parameter == False then it should behave like the normal batch norm already implemented in PyTorch, if True it should behave differently, following my modifications. I’m wanting to make this work for CPU and GPU (mainly care about GPU since this would be used in training). What I’m trying to do involves already built-in PyTorch functions like abs, sign, etc. which I would think should make making the modifications to the forward and backward functions of batch norm relatively easy. I just don’t know where to look.

Thank you!

1 Like

You can find the CUDA kernels here, but they might not be straightforward to understand.
Here are the CPU implementations, in case you want to also take a look at it.

I would personally recommend to write your experiments in a custom (Python) module first and verify the usability in a few experiments. A manual batchnorm implementation can be found here.

1 Like

Thank you for the response! I do have it working as a custom Python module currently and have some initial results that show potential. From what we have seen, we are wanting to explore further. My advisor is wanting me to modify the source C++ code so that whatever we would present in a paper is exactly the behavior one would expect if our method was implemented into PyTorch and another person uses it.

I’m curious if I were to rewrite the normal PyTorch batch norm layer using the PyTorch CPP extension, should I be able to exactly match the performance of PyTorch’s version? If that’s the case, it seems like it’d be easier to implement it that way, but we would also be able to get close enough to what my advisor is wanting.

Yes, I think using a custom C++/CUDA extension should not introduce a performance penalty compared to built-in methods.