Hello, I am building a model parallel CNN using CPU/GPU cluster. In both cases backward pass is about 6x slower than the forward pass. Particularly in CPU, Autograd profiler mainly lists convolution operations as the bottleneck. Forward functions of the modules are somewhat complex and uses lot of if statements and indexing similar to following
a = tensor.ones(2,3)
b = a[:, 0,1,2]
As there’s not even a permutation in the second dimension, I guess such indexing has no overhead. As I read, the backward pass is about 2x slower and that’s why I thought of raising a question on this. Because of the model parallel nature, weights in a single layer are divided across processes/nodes. So a single convolution layer is represented as a combination of several convolutions which span across nodes. Maybe this setting has an impact on performance. Following is the autograd profile for forward and backward pass in CPU. Any help on this regard is highly appreciated. Thanks
Profile Data
Using PyTorch v1.6, MKL-DNN v1.2.0. Here ‘Connect’ ops are the functions used for communication between nodes.
Forward Pass
-------------------------- --------------- --------------- --------------- --------------- --------------- ---------------
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls
-------------------------- --------------- --------------- --------------- --------------- --------------- ---------------
mkldnn_convolution 36.62% 120.430ms 73.66% 242.227ms 5.767ms 42
C_Connect 28.26% 92.930ms 31.67% 104.140ms 4.734ms 22
S_Connect 18.63% 61.264ms 20.19% 66.377ms 8.297ms 8
native_batch_norm 4.63% 15.231ms 10.89% 35.819ms 895.465us 40
_cat 3.58% 11.760ms 4.21% 13.851ms 364.492us 38
add 1.08% 3.536ms 2.51% 8.261ms 56.585us 146
empty 0.78% 2.562ms 0.79% 2.603ms 7.912us 329
mul 0.57% 1.861ms 1.16% 3.822ms 238.906us 16
sub 0.55% 1.796ms 1.12% 3.692ms 230.766us 16
op_Conv2D 0.47% 1.532ms 37.70% 123.974ms 5.904ms 21
is_leaf 0.41% 1.341ms 0.50% 1.648ms 0.659us 2500
leaky_relu 0.39% 1.277ms 0.84% 2.759ms 153.299us 18
Backward Pass
----------------------------------- --------------- --------------- --------------- --------------- --------------- ---------------
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls
----------------------------------- --------------- --------------- --------------- --------------- --------------- ---------------
slow_conv_dilated2d 20.68% 1.287s 69.93% 4.352s 37.949us 114688
slow_conv_transpose2d 13.49% 839.662ms 27.88% 1.735s 43.386ms 40
copy_ 11.12% 691.946ms 11.37% 707.678ms 2.842ms 249
mkldnn_convolution 10.04% 624.944ms 20.11% 1.252s 41.719ms 30
size 6.53% 406.650ms 6.53% 406.650ms 0.243us 1674695
op_Conv2DBackward 5.99% 372.616ms 95.47% 5.942s 282.958ms 21
slice 4.58% 285.351ms 8.29% 515.909ms 2.246us 229714
_cat 4.14% 257.976ms 5.55% 345.569ms 12.342ms 28
_convolution 2.87% 178.902ms 77.67% 4.834s 117.910ms 41
empty 2.75% 171.139ms 2.78% 172.734ms 1.499us 115224
fill_ 2.39% 149.045ms 2.39% 149.047ms 2.592us 57500
as_strided 2.36% 147.065ms 2.36% 147.065ms 0.631us 233126
select 2.33% 144.808ms 4.15% 258.508ms 2.193us 117878
narrow 1.96% 121.740ms 8.09% 503.789ms 4.384us 114916
C_ConnectBackward 1.95% 121.317ms 2.45% 152.702ms 6.941ms 22
contiguous 1.40% 86.923ms 1.56% 97.384ms 0.424us 229539