Analyze pytorch profile and help speed up the network

akshaybharadwaj · February 27, 2024, 6:48am

As the the topic says, I have network that takes about 2.67s/it with a batch size of 30. I am reading 1 image and 2 numpy bin array of the size(64,900,5) from the file. Running the profiler, gives me the following output

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls                                                                      
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                        cudaMemcpyAsync        52.96%       15.116s        52.96%       15.116s       2.294ms     294.319ms         1.20%     294.319ms      44.661us          6590                                                                                
                                             aten::item         0.06%      18.401ms        50.56%       14.432s       1.361ms       0.000us         0.00%     497.951ms      46.968us         10602                                                                            
                              aten::_local_scalar_dense         0.10%      29.038ms        50.53%       14.423s       1.360ms     199.000us         0.00%     498.054ms      46.977us         10602                                                                              
autograd::engine::evaluate_function: ConvolutionBack...         0.05%      14.859ms        45.83%       13.080s      17.676ms       0.000us         0.00%       13.755s      18.588ms           740                                                                                
                                  cudaStreamSynchronize        25.57%        7.299s        25.57%        7.299s       1.119ms     217.831ms         0.88%     217.831ms      33.410us          6520                                                                         
                                               aten::to         0.00%      56.000us        25.52%        7.283s     364.149ms       0.000us         0.00%       1.449ms      72.450us            20                                          
                                         aten::_to_copy         0.00%     203.000us        25.52%        7.283s     364.146ms       0.000us         0.00%       1.449ms      72.450us            20                                              
                                            aten::copy_         0.00%     276.000us        25.52%        7.283s     364.126ms       0.000us         0.00%       1.449ms      72.450us            20                                                      
                                           aten::conv2d         0.00%      62.000us         6.96%        1.988s     198.760ms       0.000us         0.00%      86.250ms       8.625ms            10                            
                                      aten::convolution         0.00%      90.000us         6.96%        1.988s     198.754ms       0.000us         0.00%      86.250ms       8.625ms            10                    
                                     aten::_convolution         0.00%     163.000us         6.96%        1.987s     198.745ms       0.000us         0.00%      86.250ms       8.625ms            10    
                                aten::cudnn_convolution         6.87%        1.961s         6.96%        1.987s     198.663ms      29.677ms         0.12%      79.445ms       7.944ms            10                    
                                               cudaFree         3.53%        1.009s         3.53%        1.009s      25.214ms       1.538ms         0.01%       1.538ms      38.450us            40                                                                                
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 28.542s
Self CUDA time total: 24.628s

Looks like most of my time is being taken by cudaMemcpyAsync. Is there anything to look for to that i might be doing inefficiently? I have already used np.memmap to load the data lazily. At this point I’m really out of ideas to speed up my processing. I’d really appreciate any help.
Thanks

ptrblck · February 27, 2024, 4:21pm

You are synchronizing the code via .item(), so try to remove it as described in the performance guide.