How to apply affine layers fast?

KaiHoo · September 15, 2020, 2:35am

Hello, I want to do the affine operation to a 4D tensor, say X with a shape of (B,C,H,W). The weight’s shape is (1,C,1,1). The reason of doing this is that I want to remove normalization but keep the affine layer. Basically, it is part of the BN’s operation and should not be slow. However, without these affine layers I can train ImageNet on a 8-GPU machine with 9.6 hours. With affine layers added, the time is 12.7 hours, which is even slower than BN models.

It surprised me that an operation that does not count into FLOPs takes so much time. I am wondering if I can do any optimization to accelerate it? Thanks!

albanD · September 15, 2020, 2:11pm

Hi,

How do you implement these transformation? Do you just do inp * weight + bias ?

KaiHoo · September 15, 2020, 4:00pm

Yes, I just do what inp * weight where inp’s shape is (B,C,H,W) and weight’s shape is (1,C,1,1) (I do not add a bias in the network)

albanD · September 15, 2020, 4:18pm

That will just do one multiplication, I don’t expect any performance issue with this (any more than any operation that does an op over the whole Tensor).

KaiHoo · September 16, 2020, 7:05am

I think this operation should run faster than the entire BN operation? However, the actual run time shows this multiplication is slower than the entire BN operation.

albanD · September 16, 2020, 2:46pm

How do you check that. I don’t seem to see that when running on CPU:

In [1]: import torch

In [2]: a = torch.rand(2, 3, 50, 50)

In [3]: b = torch.nn.BatchNorm2d(3)

In [4]: scaling = torch.rand(1, 3, 1, 1)

In [5]: %timeit b(a)
115 µs ± 652 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [6]: %timeit a * scaling
5.15 µs ± 661 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

KaiHoo · September 17, 2020, 2:36am

Yes, this is where I find weird. I found it by training on ImageNet with the ResNet50 network. I recorded the time using BN and only the affine layer. But the results are not consistent with the CPU comparison: models with BN run faster than models with affine layers only on 8 V100 GPU. I tried both DataParallel and DDP mode.

KaiHoo · September 17, 2020, 3:09am

Hello, I got to know the reason why my network is slower. Sorry for this post!