Analyze pytorch profile and help speed up the network

As the the topic says, I have network that takes about 2.67s/it with a batch size of 30. I am reading 1 image and 2 numpy bin array of the size(64,900,5) from the file. Running the profiler, gives me the following output

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls                                                                      
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                        cudaMemcpyAsync        52.96%       15.116s        52.96%       15.116s       2.294ms     294.319ms         1.20%     294.319ms      44.661us          6590                                                                                
                                             aten::item         0.06%      18.401ms        50.56%       14.432s       1.361ms       0.000us         0.00%     497.951ms      46.968us         10602                                                                            
                              aten::_local_scalar_dense         0.10%      29.038ms        50.53%       14.423s       1.360ms     199.000us         0.00%     498.054ms      46.977us         10602                                                                              
autograd::engine::evaluate_function: ConvolutionBack...         0.05%      14.859ms        45.83%       13.080s      17.676ms       0.000us         0.00%       13.755s      18.588ms           740                                                                                
                                  cudaStreamSynchronize        25.57%        7.299s        25.57%        7.299s       1.119ms     217.831ms         0.88%     217.831ms      33.410us          6520                                                                         
                                               aten::to         0.00%      56.000us        25.52%        7.283s     364.149ms       0.000us         0.00%       1.449ms      72.450us            20                                          
                                         aten::_to_copy         0.00%     203.000us        25.52%        7.283s     364.146ms       0.000us         0.00%       1.449ms      72.450us            20                                              
                                            aten::copy_         0.00%     276.000us        25.52%        7.283s     364.126ms       0.000us         0.00%       1.449ms      72.450us            20                                                      
                                           aten::conv2d         0.00%      62.000us         6.96%        1.988s     198.760ms       0.000us         0.00%      86.250ms       8.625ms            10                            
                                      aten::convolution         0.00%      90.000us         6.96%        1.988s     198.754ms       0.000us         0.00%      86.250ms       8.625ms            10                    
                                     aten::_convolution         0.00%     163.000us         6.96%        1.987s     198.745ms       0.000us         0.00%      86.250ms       8.625ms            10    
                                aten::cudnn_convolution         6.87%        1.961s         6.96%        1.987s     198.663ms      29.677ms         0.12%      79.445ms       7.944ms            10                    
                                               cudaFree         3.53%        1.009s         3.53%        1.009s      25.214ms       1.538ms         0.01%       1.538ms      38.450us            40                                                                                
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 28.542s
Self CUDA time total: 24.628s

Looks like most of my time is being taken by cudaMemcpyAsync. Is there anything to look for to that i might be doing inefficiently? I have already used np.memmap to load the data lazily. At this point I’m really out of ideas to speed up my processing. I’d really appreciate any help.
Thanks

You are synchronizing the code via .item(), so try to remove it as described in the performance guide.