Trying to understand AffineChannel operation

I’m trying to reproduce RetinaNet in PyTorch by directly porting the original Caffe implementation. One issue I’ve run into is that Detectron makes use of a normalization operation called AffineChannel instead of batch normalization. This is due to the small batch sizes one encounters when training object detection models.


Cuda Kernel for AffineChanne:

Am I correct in my understanding that AffineChannel simply multiplies each channel but it’s own learnable scale parameter and adds a learnable bias parameter?

Is there any evidence that this helps learning? I haven’t encountered this approach before.

Based on your description and skimming through the code, it seems you are right in your assumption.
Wouldn’t the operation thus correspond to a nn.BatchNorm2d layer with track_running_stats=False? This might make sense, if the batch size is small, as the running estimates will most likely be quite noisy.

Yes I think you might be right. I’ve never used track_running_stats before and I see it’s described as:

track_running_stats: a boolean value that when set to `True`, this
            module tracks the running mean and variance, and when set to `False`,
            this module does not track such statistics and always uses batch
            statistics in both training and eval modes. Default: `True`

It sounds like while it doesn’t track a running mean/variance, it is still using the batch mean and variance? In my case I think I would want a fixed mean of 0 and a variance of 1, but still allow for gamma and beta to be learnable parameters, correct?

Yeah, you are right, sorry for the mistake.
It would rather correspond to track_running_stats=True and set to evaluation mode (bn.eval()), which would then use the initial running stats.

It would rather correspond to track_running_stats=True and set to evaluation mode ( bn.eval() ), which would then use the initial running stats.

That sounds like it will freeze the statistics (eg. mean and variance) but it also sounds like it will freeze gamma and beta which appear to be learnable in AffineChannel.

I think I might be able to mimic theAffineChannel operation by creating a vector of torch.ones([num_channels]) and torch.zeros([num_channels]) and using these for gamma and beta respectively. I’ll give it a shot and see how it goes.

bn.eval() will not freeze the trainable parameters, just use the running statistics, which would have their initial values.

1 Like