Hello, I want to do the affine operation to a 4D tensor, say X with a shape of (B,C,H,W). The weight’s shape is (1,C,1,1). The reason of doing this is that I want to remove normalization but keep the affine layer. Basically, it is part of the BN’s operation and should not be slow. However, without these affine layers I can train ImageNet on a 8-GPU machine with 9.6 hours. With affine layers added, the time is 12.7 hours, which is even slower than BN models.

It surprised me that an operation that does not count into FLOPs takes so much time. I am wondering if I can do any optimization to accelerate it? Thanks!

That will just do one multiplication, I don’t expect any performance issue with this (any more than any operation that does an op over the whole Tensor).

I think this operation should run faster than the entire BN operation? However, the actual run time shows this multiplication is slower than the entire BN operation.

Yes, this is where I find weird. I found it by training on ImageNet with the ResNet50 network. I recorded the time using BN and only the affine layer. But the results are not consistent with the CPU comparison: models with BN run faster than models with affine layers only on 8 V100 GPU. I tried both DataParallel and DDP mode.