Hello, I want to do the affine operation to a 4D tensor, say X with a shape of (B,C,H,W). The weight’s shape is (1,C,1,1). The reason of doing this is that I want to remove normalization but keep the affine layer. Basically, it is part of the BN’s operation and should not be slow. However, without these affine layers I can train ImageNet on a 8-GPU machine with 9.6 hours. With affine layers added, the time is 12.7 hours, which is even slower than BN models.
It surprised me that an operation that does not count into FLOPs takes so much time. I am wondering if I can do any optimization to accelerate it? Thanks!
How do you implement these transformation? Do you just do
inp * weight + bias ?
Yes, I just do what inp * weight where inp’s shape is (B,C,H,W) and weight’s shape is (1,C,1,1) (I do not add a bias in the network)
That will just do one multiplication, I don’t expect any performance issue with this (any more than any operation that does an op over the whole Tensor).
I think this operation should run faster than the entire BN operation? However, the actual run time shows this multiplication is slower than the entire BN operation.
How do you check that. I don’t seem to see that when running on CPU:
In : import torch
In : a = torch.rand(2, 3, 50, 50)
In : b = torch.nn.BatchNorm2d(3)
In : scaling = torch.rand(1, 3, 1, 1)
In : %timeit b(a)
115 µs ± 652 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In : %timeit a * scaling
5.15 µs ± 661 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Yes, this is where I find weird. I found it by training on ImageNet with the ResNet50 network. I recorded the time using BN and only the affine layer. But the results are not consistent with the CPU comparison: models with BN run faster than models with affine layers only on 8 V100 GPU. I tried both DataParallel and DDP mode.
Hello, I got to know the reason why my network is slower. Sorry for this post!