Hi All,
I’m comparing two networks: a single large convolution and a bottleneck block consisting of 3 (example A) or 2 (example B) convolutions. Profiling their feed forward runtime time with python (with appropriate torch.cuda.synchronize() calls, via python -m bottleneck.py and nvprof), runtimes are not even close to the flop count prediction. In fact, the smaller convolutions are slower than the large one. There seems to be an CPU overhead in torch.conv2d that’s not CPU-GPU communication, so each additional layer adds overhead.
Example A: flop count speedup = 10.
input size: (64, 160, 120)
baseline_layer = torch.nn.Conv2d(64, 64, kernel_size=(3, 3), bias=False)
distilled_layer = torch.nn.Sequential(
torch.nn.Conv2d(64, 14, kernel_size=(1, 1), bias=False),
torch.nn.Conv2d(14, 15, kernel_size=(3, 3), bias=False),
torch.nn.Conv2d(15, 64, kernel_size=(1, 1), bias=False))
Example B: flop count speedup = 85.3.
input size: (256, 160, 120)
baseline_layer2 = torch.nn.Conv2d(256, 128, kernel_size=(1, 1), bias=False)
distilled_layer2 = torch.nn.Sequential(
torch.nn.Conv2d(256, 1, kernel_size=(1, 1), bias=False),
torch.nn.Conv2d(1, 128, kernel_size=(1, 1), bias=False))
Timing
Times are in ms (approximate, averaged over 10000 feed forwards):
Example GPU Time torch.conv2d time total elapsed
(nvprof) (cProfile) time
A - baseline .12 .08 .2
A - distilled .15 .16 .3
B - baseline .27 .05 .34
B - distilled .18 .12 .29
The torch.conv2d overhead becomes more pronounced in a larger network like ResNet containing many such bottleneck blocks.
I understand the GPU time won’t reflect the op count if the image / convolution are not large enough, as it is highly optimized. But is it possible to remove the torch.conv2d overhead associated with adding more layers? Would that require writing a custom C++ or CUDA extension?