As the the topic says, I have network that takes about 2.67s/it with a batch size of 30. I am reading 1 image and 2 numpy bin array of the size(64,900,5) from the file. Running the profiler, gives me the following output
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
cudaMemcpyAsync 52.96% 15.116s 52.96% 15.116s 2.294ms 294.319ms 1.20% 294.319ms 44.661us 6590
aten::item 0.06% 18.401ms 50.56% 14.432s 1.361ms 0.000us 0.00% 497.951ms 46.968us 10602
aten::_local_scalar_dense 0.10% 29.038ms 50.53% 14.423s 1.360ms 199.000us 0.00% 498.054ms 46.977us 10602
autograd::engine::evaluate_function: ConvolutionBack... 0.05% 14.859ms 45.83% 13.080s 17.676ms 0.000us 0.00% 13.755s 18.588ms 740
cudaStreamSynchronize 25.57% 7.299s 25.57% 7.299s 1.119ms 217.831ms 0.88% 217.831ms 33.410us 6520
aten::to 0.00% 56.000us 25.52% 7.283s 364.149ms 0.000us 0.00% 1.449ms 72.450us 20
aten::_to_copy 0.00% 203.000us 25.52% 7.283s 364.146ms 0.000us 0.00% 1.449ms 72.450us 20
aten::copy_ 0.00% 276.000us 25.52% 7.283s 364.126ms 0.000us 0.00% 1.449ms 72.450us 20
aten::conv2d 0.00% 62.000us 6.96% 1.988s 198.760ms 0.000us 0.00% 86.250ms 8.625ms 10
aten::convolution 0.00% 90.000us 6.96% 1.988s 198.754ms 0.000us 0.00% 86.250ms 8.625ms 10
aten::_convolution 0.00% 163.000us 6.96% 1.987s 198.745ms 0.000us 0.00% 86.250ms 8.625ms 10
aten::cudnn_convolution 6.87% 1.961s 6.96% 1.987s 198.663ms 29.677ms 0.12% 79.445ms 7.944ms 10
cudaFree 3.53% 1.009s 3.53% 1.009s 25.214ms 1.538ms 0.01% 1.538ms 38.450us 40
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 28.542s
Self CUDA time total: 24.628s
Looks like most of my time is being taken by cudaMemcpyAsync
. Is there anything to look for to that i might be doing inefficiently? I have already used np.memmap
to load the data lazily. At this point I’m really out of ideas to speed up my processing. I’d really appreciate any help.
Thanks