Hello, I want to do the affine operation to a 4D tensor, say X with a shape of (B,C,H,W). The weight’s shape is (1,C,1,1). The reason of doing this is that I want to remove normalization but keep the affine layer. Basically, it is part of the BN’s operation and should not be slow. However, without these affine layers I can train ImageNet on a 8-GPU machine with 9.6 hours. With affine layers added, the time is 12.7 hours, which is even slower than BN models.

It surprised me that an operation that does not count into FLOPs takes so much time. I am wondering if I can do any optimization to accelerate it? Thanks!