Low gpu utilization whith very fast dataloading

Hello everyone, I can’t find the bottleneck of my training program.
As you can see the gpus utilization is low 43% (I am using bsz of 40 which consumes 12GB/16GB).
The bottleneck isn’t from the dataloading as you can see. It seems to come from the forward and backward…


When I check the operators execution time on CPU, for some reason they consume CPU time. I am only using native pytorch tensors in my forward and loss computation.
https://imgur.com/a/VdVqTTY

I profiled the forward separately, and it show a CPU usage of 2s. But the individual operations don’t sum to 100% CPU total. I am confused.

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          model_forward         0.40%       8.407ms       100.00%        2.106s        2.106s       0.000us         0.00%      51.656ms      51.656ms             1
                                              aten::bmm         0.14%       3.033ms         0.30%       6.262ms     313.100us      23.576ms        45.64%      23.576ms       1.179ms            20
                                maxwell_sgemm_128x64_nn         0.00%       0.000us         0.00%       0.000us       0.000us      18.129ms        35.10%      18.129ms       2.014ms             9
                                             aten::add_         0.52%      10.907ms         0.55%      11.569ms     236.102us      11.983ms        23.20%      11.983ms     244.551us            49
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us      11.282ms        21.84%      11.282ms     245.261us            46
                                          aten::softmax         0.00%      67.000us         0.34%       7.056ms     705.600us       0.000us         0.00%       8.552ms     855.200us            10
                                         aten::_softmax         0.08%       1.584ms         0.33%       6.989ms     698.900us       8.552ms        16.56%       8.552ms     855.200us            10
void (anonymous namespace)::softmax_warp_forward<flo...         0.00%       0.000us         0.00%       0.000us       0.000us       7.342ms        14.21%       7.342ms       1.049ms             7
                                maxwell_sgemm_128x64_tn         0.00%       0.000us         0.00%       0.000us       0.000us       5.623ms        10.89%       5.623ms     401.643us            14
                                           aten::linear         0.02%     344.000us         2.03%      42.657ms       1.066ms       0.000us         0.00%       2.733ms      68.325us            40
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.106s
Self CUDA time total: 51.656ms

Thank your for your help.

Is this model using the C++ API rather than Python? It might be useful to share some of the training loop code to check if the profile is missing something.

1 Like

Hello Eqy, thanks for your help. Its using only python. I am using Transformer model (pytorch module).

I timed the training loops myself (using the resnet example):

Train Epoch: 1 [7341/29371 (24.9940 %)]         Loss: 4.6176743507385254        GPU Time: 0:08:22.217366        Dataloader Time: 0:01:01.206019    Others: 0:00:11.105754
Train Epoch: 1 [14683/29371 (49.9915 %)]        Loss: 3.3774175643920898        GPU Time: 0:08:11.256317        Dataloader Time: 0:00:43.877545    Others: 0:00:09.344856
Train Epoch: 1 [22025/29371 (74.9889 %)]        Loss: 3.6628768444061279        GPU Time: 0:08:09.219690        Dataloader Time: 0:00:41.986479    Others: 0:00:08.147579
Train Epoch: 1 [29367/29371 (99.9864 %)]        Loss: 2.7599744796752930        GPU Time: 0:08:24.159365        Dataloader Time: 0:00:40.122311    Others: 0:00:09.004757

This is the total time over 1/4 of the total datasets.

It can be a bit difficult to vet the timing code without a snippet (e.g., are there appropriate torch.cuda.synchronize calls in appropriate places?). It does look like the dataloader time is still significant, as it is expected to be effectively zero to minimize overhead (it is expected to be zero as it can run independently of and ahead of the current training iteration).

I don’t use torch.cuda.synchronize. I will try to lower the dataloader time to zero ! thanks for the help