Can not attain better performances after changing nvidia GPU

Hello guys!

For research issues we rented a new workstation (workstation 1) with three NVIDIA RTX A5000 GPU (ubuntu 18.04LTS OS, CUDA 11.1 and spyder) because we need to train a much bigger model with 3D training samples. To test the performances of the model, I decreased the size of each training sample to 50 by 50 by 50 and compare the performances of the old workstation (workstation 2) with three NVIDIA RTX 2080Ti GPU (CUDA 10.2, ubuntu 16.04LTS and spyder) and the rented one using the same python script. The results shows that in the training process each epoch ran for 0.21 minutes on workstation 2 but it ran for 1.38 minutes on workstation 1. I wanted to find out the issues because there is no reason that A5000 GPU runs slower than NVIDIA RTX 2080Ti. The performances are listed as follows:

Workstation 1:
GPU:three NVIDIA RTX A5000 GPU
Parallelization technique: 
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
if str(device)=="cuda:0":
    print('Use Cuda GPU!')

if torch.cuda.device_count() > 1:
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    net = nn.DataParallel(net)
optimizer = optim.Adam(net.parameters(), lr=0.001)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=20, shuffle=False, num_workers=0)
Let's name three GPU as GPU 0, 1 and 2
Training time for each epoch:
1. only GPU 0 without DataParallel: 1.38 min (ubuntu OS unaffected)
2. only GPU 1 with DataParallel: 1.41 min (OS become slow, moving other window will display frame by frame)
3. GPU 1 and GPU 2 working in parallel : 6 min (OS become slow, moving other window will display frame by frame)
4. GPU 0,1,2 working in parallel: 3.8 min (OS become slow, moving other window will display frame by frame)


Workstation 2:
GPU:three NVIDIA RTX 2080Ti GPU
Parallelization technique: 
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
if str(device)=="cuda:0":
    print('Use Cuda GPU!')

if torch.cuda.device_count() > 1:
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    net = nn.DataParallel(net)
optimizer = optim.Adam(net.parameters(), lr=0.001)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=20, shuffle=False, num_workers=0)
Let's name three GPU as GPU 0, 1 and 2
Training time for each epoch:
1. GPU 0,1,2 working in parallel: 0.21 min (OS become slow, but not severe as Workstation 1)

Could you please provide any solutions so that we can improve the performances on A5000 GPU? Thanks a lot!

@tjk I would recommend using a profiler to understand which operation is becoming the bottleneck
The profiler that i use is
https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html

1 Like

@anantguptadbl Many thanks, I have generated the pytorch logs following your instructions. It seems that CUDNN costs the most time. the logs are listed as follows:

Workstation 2, no problems, NVIDIA RTX 2080Ti

Workstation2, 1GPU:
Epoch: 0

0.00225191914183454
0.0004021667661433722
0.00015078190647882007
0.00042524349663673137
0.00025625204780785804
0.00013806319813493
0.00017392947503721304
0.00023006257905818802
0.503632 minutes passed!

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                               aten::mm         0.00%       1.250ms         7.77%        2.852s      71.302ms       11.552s        48.44%       11.552s     288.803ms            40  
                                        model_inference        27.51%       10.099s        76.48%       28.072s       28.072s     486.375ms         2.04%        9.208s        9.208s             1  
                                   DataParallel.forward         0.01%       5.039ms         7.83%        2.872s     359.048ms       0.000us         0.00%        8.507s        1.063s             8  
                                             MmBackward         0.00%     259.000us         0.01%       1.930ms     120.625us       0.000us         0.00%        7.302s     456.400ms            16  
                               CudnnConvolutionBackward         0.00%     651.000us        23.46%        8.613s     215.323ms       0.000us         0.00%        7.286s     182.143ms            40  
                       aten::cudnn_convolution_backward         0.00%       1.056ms        23.46%        8.612s     215.307ms       0.000us         0.00%        7.286s     182.143ms            40  
                aten::cudnn_convolution_backward_weight         0.01%       2.466ms         0.01%       5.480ms     137.000us        5.375s        22.54%        5.375s     134.373ms            40  
void cudnn::detail::convolve_wgradNd_engine<3, 256, ...         0.00%       0.000us         0.00%       0.000us       0.000us        5.287s        22.17%        5.287s     220.288ms            24  
                                   volta_dgemm_64x64_nt         0.00%       0.000us         0.00%       0.000us       0.000us        4.294s        18.01%        4.294s     268.395ms            16  
                                           aten::linear         0.00%     623.000us         7.77%        2.853s     178.325ms       0.000us         0.00%        4.264s     266.498ms            16  
                                           aten::matmul         0.00%     328.000us         7.77%        2.851s     178.211ms       0.000us         0.00%        4.250s     265.608ms            16  
                                      aten::convolution         0.00%     326.000us         0.02%       7.692ms     192.300us       0.000us         0.00%        4.219s     105.475ms            40  
                                     aten::_convolution         0.00%     743.000us         0.02%       7.366ms     184.150us       0.000us         0.00%        4.219s     105.475ms            40  
                                aten::cudnn_convolution         0.01%       2.312ms         0.01%       4.947ms     123.675us        4.210s        17.65%        4.210s     105.242ms            40  
                                           aten::conv3d         0.00%     153.000us         0.01%       4.405ms     183.542us       0.000us         0.00%        4.157s     173.195ms            24  
void cudnn::detail::implicit_convolveNd_dgemm<3, 256...         0.00%       0.000us         0.00%       0.000us       0.000us        4.149s        17.40%        4.149s     172.865ms            24  
                                  volta_dgemm_128x64_tn         0.00%       0.000us         0.00%       0.000us       0.000us        4.038s        16.93%        4.038s     269.214ms            15  
                                  volta_dgemm_128x64_nn         0.00%       0.000us         0.00%       0.000us       0.000us        3.008s        12.61%        3.008s     376.010ms             8  
                 aten::cudnn_convolution_backward_input         0.17%      60.883ms        23.44%        8.606s     215.143ms        1.911s         8.01%        1.911s      47.771ms            40  
                                   volta_zgemm_32x32_tn         0.00%       0.000us         0.00%       0.000us       0.000us     761.130ms         3.19%     761.130ms     380.565us          2000  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 36.707s
Self CUDA time total: 23.847s

Epoch: 1

0.0002444413932154234
0.00023899785216162953
0.00020258519901078276
0.00016205136832252151
0.00013764596967574087
0.00013781612243572125
0.00017208402650515018
0.00018750306695110606
0.442265 minutes passed!

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                               aten::mm         0.00%       1.031ms         0.01%       1.959ms      48.975us       11.748s        49.46%       11.748s     293.694ms            40  
                                        model_inference        27.72%        9.095s        72.97%       23.941s       23.941s      51.884ms         0.22%        9.002s        9.002s             1  
                                   DataParallel.forward         0.01%       4.383ms         0.06%      18.424ms       2.303ms       0.000us         0.00%        8.735s        1.092s             8  
                               CudnnConvolutionBackward         0.00%     694.000us        26.97%        8.850s     221.252ms       0.000us         0.00%        7.354s     183.838ms            40  
                       aten::cudnn_convolution_backward         0.00%       1.157ms        26.97%        8.849s     221.235ms       0.000us         0.00%        7.354s     183.838ms            40  
                                             MmBackward         0.00%     257.000us         0.00%       1.528ms      95.500us       0.000us         0.00%        7.342s     458.885ms            16  
                aten::cudnn_convolution_backward_weight         0.01%       1.801ms         0.02%       5.078ms     126.950us        5.432s        22.87%        5.432s     135.808ms            40  
void cudnn::detail::convolve_wgradNd_engine<3, 256, ...         0.00%       0.000us         0.00%       0.000us       0.000us        5.344s        22.50%        5.344s     222.679ms            24  
                                           aten::linear         0.00%     224.000us         0.01%       2.705ms     169.062us       0.000us         0.00%        4.421s     276.290ms            16  
                                           aten::matmul         0.00%     249.000us         0.00%       1.548ms      96.750us       0.000us         0.00%        4.406s     275.351ms            16  
                                   volta_dgemm_64x64_nt         0.00%       0.000us         0.00%       0.000us       0.000us        4.321s        18.19%        4.321s     270.063ms            16  
                                      aten::convolution         0.00%     350.000us         0.02%       7.188ms     179.700us       0.000us         0.00%        4.290s     107.250ms            40  
                                     aten::_convolution         0.00%     764.000us         0.02%       6.838ms     170.950us       0.000us         0.00%        4.290s     107.250ms            40  
                                aten::cudnn_convolution         0.01%       3.003ms         0.01%       4.703ms     117.575us        4.280s        18.02%        4.280s     106.999ms            40  
                                           aten::conv3d         0.00%     165.000us         0.01%       3.610ms     150.417us       0.000us         0.00%        4.228s     176.160ms            24  
void cudnn::detail::implicit_convolveNd_dgemm<3, 256...         0.00%       0.000us         0.00%       0.000us       0.000us        4.219s        17.76%        4.219s     175.790ms            24  
                                  volta_dgemm_128x64_tn         0.00%       0.000us         0.00%       0.000us       0.000us        4.196s        17.66%        4.196s     279.705ms            15  
                                  volta_dgemm_128x64_nn         0.00%       0.000us         0.00%       0.000us       0.000us        3.021s        12.72%        3.021s     377.643ms             8  
                 aten::cudnn_convolution_backward_input         0.18%      58.439ms        26.95%        8.843s     221.079ms        1.921s         8.09%        1.921s      48.030ms            40  
                                   volta_zgemm_32x32_tn         0.00%       0.000us         0.00%       0.000us       0.000us     755.710ms         3.18%     755.710ms     377.855us          2000  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 32.809s
Self CUDA time total: 23.751s

Workstation2, 2GPU:
Use Cuda GPU!
Let's use 2 GPUs!
Epoch: 0

0.0010225729964364453
0.00027090615543547516
0.00019945351665674564
0.0001817381311407199
0.00016949458102390858
0.00015111339930719533
0.0001469659275806815
0.00014119658428216468
0.368648 minutes passed!

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               CudnnConvolutionBackward         0.01%       1.411ms        28.89%        7.683s      96.035ms       0.000us         0.00%       10.228s     127.855ms            80  
                       aten::cudnn_convolution_backward         0.01%       2.524ms        28.89%        7.681s      96.017ms       0.000us         0.00%       10.228s     127.855ms            80  
                aten::cudnn_convolution_backward_weight         0.02%       4.278ms         0.04%      10.996ms     137.450us        7.095s        25.96%        7.095s      88.682ms            80  
void cudnn::detail::convolve_wgradNd_engine<3, 256, ...         0.00%       0.000us         0.00%       0.000us       0.000us        7.002s        25.62%        7.002s     145.869ms            48  
                                             MmBackward         0.00%     636.000us         0.02%       3.991ms     124.719us       0.000us         0.00%        6.988s     218.368ms            32  
                                               aten::mm         0.00%     848.000us         0.01%       2.388ms      49.750us        6.988s        25.57%        6.988s     145.578ms            48  
                                   volta_dgemm_64x64_nt         0.00%       0.000us         0.00%       0.000us       0.000us        4.090s        14.97%        4.090s     127.813ms            32  
void cudnn::detail::implicit_convolveNd_dgemm<3, 256...         0.00%       0.000us         0.00%       0.000us       0.000us        3.959s        14.49%        3.959s      82.474ms            48  
                 aten::cudnn_convolution_backward_input         0.47%     123.661ms        28.84%        7.668s      95.848ms        3.134s        11.47%        3.134s      39.173ms            80  
                                   volta_dgemm_64x64_tn         0.00%       0.000us         0.00%       0.000us       0.000us        2.963s        10.84%        2.963s     164.631ms            18  
                                  volta_dgemm_128x64_nn         0.00%       0.000us         0.00%       0.000us       0.000us        2.709s         9.91%        2.709s     193.478ms            14  
                                   volta_zgemm_32x32_tn         0.00%       0.000us         0.00%       0.000us       0.000us        1.286s         4.71%        1.286s     367.442us          3500  
                                  volta_dgemm_128x64_tn         0.00%       0.000us         0.00%       0.000us       0.000us        1.175s         4.30%        1.175s      83.954ms            14  
                                        model_inference        20.29%        5.394s        69.30%       18.426s       18.426s     477.363ms         1.75%     921.022ms     921.022ms             1  
                                      BroadcastBackward         0.00%     414.000us         0.07%      17.597ms       2.200ms       0.000us         0.00%     919.384ms     114.923ms             8  
                                     ReduceAddCoalesced         0.02%       5.676ms         0.06%      17.183ms       2.148ms     919.314ms         3.36%     919.384ms     114.923ms             8  
               ncclReduceRingLLKernel_sum_f64(ncclColl)         0.00%       0.000us         0.00%       0.000us       0.000us     919.314ms         3.36%     919.314ms      14.364ms            64  
void cudnn::detail::dgrad2d_alg1_1<double, 0, 7, 5, ...         0.00%       0.000us         0.00%       0.000us       0.000us     559.521ms         2.05%     559.521ms      34.970ms            16  
                       Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us     477.380ms         1.75%     477.380ms       3.510ms           136  
void convolveNd_dgrad_engine<3, 256, false, true, 6,...         0.00%       0.000us         0.00%       0.000us       0.000us     325.446ms         1.19%     325.446ms      54.241ms             6  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 26.589s
Self CUDA time total: 27.325s

Epoch: 1

0.00012511684844552457
0.00012616094675988522
0.00012037825835751655
0.00011708543681867549
0.00011416520242080682
0.00010525721590552889
0.00011137502517889492
0.00011518851261321432
0.318012 minutes passed!

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               CudnnConvolutionBackward         0.01%       1.475ms        37.49%        8.834s     110.428ms       0.000us         0.00%       10.470s     130.877ms            80  
                       aten::cudnn_convolution_backward         0.01%       2.533ms        37.49%        8.833s     110.410ms       0.000us         0.00%       10.470s     130.877ms            80  
                aten::cudnn_convolution_backward_weight         0.02%       4.255ms         0.05%      11.573ms     144.662us        7.258s        25.94%        7.258s      90.729ms            80  
                                             MmBackward         0.00%     797.000us         0.02%       4.101ms     128.156us       0.000us         0.00%        7.212s     225.371ms            32  
                                               aten::mm         0.00%     942.000us         0.01%       2.213ms      46.104us        7.212s        25.77%        7.212s     150.248ms            48  
void cudnn::detail::convolve_wgradNd_engine<3, 256, ...         0.00%       0.000us         0.00%       0.000us       0.000us        7.166s        25.61%        7.166s     149.290ms            48  
                                   volta_dgemm_64x64_nt         0.00%       0.000us         0.00%       0.000us       0.000us        4.224s        15.09%        4.224s     131.988ms            32  
void cudnn::detail::implicit_convolveNd_dgemm<3, 256...         0.00%       0.000us         0.00%       0.000us       0.000us        4.107s        14.68%        4.107s      85.566ms            48  
                 aten::cudnn_convolution_backward_input         0.49%     115.315ms        37.43%        8.819s     110.234ms        3.212s        11.48%        3.212s      40.148ms            80  
                                   volta_dgemm_64x64_tn         0.00%       0.000us         0.00%       0.000us       0.000us        3.052s        10.91%        3.052s     169.567ms            18  
                                  volta_dgemm_128x64_nn         0.00%       0.000us         0.00%       0.000us       0.000us        2.799s        10.00%        2.799s     199.932ms            14  
                                   volta_zgemm_32x32_tn         0.00%       0.000us         0.00%       0.000us       0.000us        1.331s         4.76%        1.331s     380.226us          3500  
                                      BroadcastBackward         0.00%     477.000us         0.07%      17.086ms       2.136ms       0.000us         0.00%        1.263s     157.828ms             8  
                                     ReduceAddCoalesced         0.03%       7.598ms         0.07%      16.609ms       2.076ms        1.263s         4.51%        1.263s     157.828ms             8  
               ncclReduceRingLLKernel_sum_f64(ncclColl)         0.00%       0.000us         0.00%       0.000us       0.000us        1.263s         4.51%        1.263s      19.727ms            64  
                                  volta_dgemm_128x64_tn         0.00%       0.000us         0.00%       0.000us       0.000us        1.208s         4.32%        1.208s      86.271ms            14  
void cudnn::detail::dgrad2d_alg1_1<double, 0, 7, 5, ...         0.00%       0.000us         0.00%       0.000us       0.000us     573.974ms         2.05%     573.974ms      35.873ms            16  
                                        model_inference        19.93%        4.695s        62.19%       14.653s       14.653s      55.134ms         0.20%     495.585ms     495.585ms             1  
void fft3d_r2c_16x16x16<double, double, double2>(dou...         0.00%       0.000us         0.00%       0.000us       0.000us     323.758ms         1.16%     323.758ms      61.179us          5292  
void convolveNd_dgrad_engine<3, 256, false, true, 6,...         0.00%       0.000us         0.00%       0.000us       0.000us     318.401ms         1.14%     318.401ms      53.067ms             6  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 23.562s
Self CUDA time total: 27.984s

Workstation2, 3GPU:
Use Cuda GPU!
Let's use 3 GPUs!
Epoch: 0

0.0021289571904871563
0.0016883809406143598
0.001742074475482782
0.0008426444270598645
0.0002728257843646463
0.0001246445754211182
0.0002026632660951209
0.0003172990978782591
0.326939 minutes passed!

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               CudnnConvolutionBackward         0.03%       4.928ms         0.58%     110.048ms     917.067us       0.000us         0.00%       19.494s     162.454ms           120  
                       aten::cudnn_convolution_backward         0.09%      16.271ms         0.55%     105.120ms     876.000us       0.000us         0.00%       19.494s     162.454ms           120  
                 aten::cudnn_convolution_backward_input         0.12%      22.073ms         0.25%      47.649ms     397.075us       11.774s        25.98%       11.774s      98.115ms           120  
void convolveNd_dgrad_engine<3, 256, false, true, 6,...         0.00%       0.000us         0.00%       0.000us       0.000us       10.914s        24.08%       10.914s     151.580ms            72  
                aten::cudnn_convolution_backward_weight         0.09%      16.677ms         0.22%      41.200ms     343.333us        7.721s        17.03%        7.721s      64.339ms           120  
void cudnn::detail::convolve_wgradNd_engine<3, 256, ...         0.00%       0.000us         0.00%       0.000us       0.000us        7.599s        16.76%        7.599s     105.537ms            72  
                                             MmBackward         0.02%       3.545ms         0.09%      17.972ms     374.417us       0.000us         0.00%        6.993s     145.691ms            48  
                                               aten::mm         0.02%       4.165ms         0.05%       9.172ms     127.389us        6.993s        15.43%        6.993s      97.127ms            72  
                                        model_inference         8.43%        1.596s        98.49%       18.657s       18.657s     430.758ms         0.95%        6.187s        6.187s             1  
                                   DataParallel.forward         0.31%      58.797ms        14.21%        2.692s     336.507ms       0.000us         0.00%        5.544s     692.943ms             8  
                                              Broadcast         0.02%       3.396ms        13.70%        2.596s     324.517ms        5.428s        11.97%        5.430s     678.748ms             8  
            ncclBroadcastRingLLKernel_copy_i8(ncclColl)         0.00%       0.000us         0.00%       0.000us       0.000us        5.428s        11.97%        5.428s      75.383ms            72  
                                      BroadcastBackward         0.00%     942.000us         0.19%      36.168ms       4.521ms       0.000us         0.00%        4.407s     550.897ms             8  
                                     ReduceAddCoalesced         0.08%      15.068ms         0.19%      35.226ms       4.403ms        4.407s         9.72%        4.407s     550.897ms             8  
               ncclReduceRingLLKernel_sum_f64(ncclColl)         0.00%       0.000us         0.00%       0.000us       0.000us        4.407s         9.72%        4.407s      45.907ms            96  
                                   volta_dgemm_64x64_nt         0.00%       0.000us         0.00%       0.000us       0.000us        4.143s         9.14%        4.143s      86.311ms            48  
                                   volta_dgemm_64x64_tn         0.00%       0.000us         0.00%       0.000us       0.000us        4.063s         8.96%        4.063s      84.650ms            48  
void cudnn::detail::implicit_convolveNd_dgemm<3, 256...         0.00%       0.000us         0.00%       0.000us       0.000us        3.999s         8.82%        3.999s      55.543ms            72  
                                   volta_dgemm_64x64_nn         0.00%       0.000us         0.00%       0.000us       0.000us        2.850s         6.29%        2.850s     118.760ms            24  
void cudnn::detail::dgrad2d_alg1_1<double, 0, 7, 5, ...         0.00%       0.000us         0.00%       0.000us       0.000us     845.037ms         1.86%     845.037ms      35.210ms            24  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 18.943s
Self CUDA time total: 45.325s

Epoch: 1

0.0003756547194822839
0.0003834348244248574
0.0003493833935937891
0.00030744885684325297
0.0002622691514377395
0.0002269612708429492
0.00019472064172520355
0.00017378384075557809
0.259148 minutes passed!

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               CudnnConvolutionBackward         0.03%       4.605ms         0.80%     118.913ms     990.942us       0.000us         0.00%       19.734s     164.449ms           120  
                       aten::cudnn_convolution_backward         0.10%      15.166ms         0.77%     114.308ms     952.567us       0.000us         0.00%       19.734s     164.449ms           120  
                 aten::cudnn_convolution_backward_input         0.18%      26.837ms         0.37%      55.256ms     460.467us       11.934s        29.56%       11.934s      99.454ms           120  
void convolveNd_dgrad_engine<3, 256, false, true, 6,...         0.00%       0.000us         0.00%       0.000us       0.000us       11.071s        27.42%       11.071s     153.758ms            72  
                aten::cudnn_convolution_backward_weight         0.12%      17.281ms         0.29%      43.886ms     365.717us        7.799s        19.31%        7.799s      64.995ms           120  
void cudnn::detail::convolve_wgradNd_engine<3, 256, ...         0.00%       0.000us         0.00%       0.000us       0.000us        7.673s        19.00%        7.673s     106.576ms            72  
                                             MmBackward         0.03%       3.972ms         0.13%      19.295ms     401.979us       0.000us         0.00%        7.030s     146.458ms            48  
                                               aten::mm         0.03%       4.422ms         0.07%       9.726ms     135.083us        7.030s        17.41%        7.030s      97.639ms            72  
                                      BroadcastBackward         0.02%       2.257ms         0.19%      27.849ms       3.481ms       0.000us         0.00%        4.581s     572.597ms             8  
                                     ReduceAddCoalesced         0.08%      11.266ms         0.17%      25.592ms       3.199ms        4.581s        11.34%        4.581s     572.597ms             8  
               ncclReduceRingLLKernel_sum_f64(ncclColl)         0.00%       0.000us         0.00%       0.000us       0.000us        4.581s        11.34%        4.581s      47.715ms            96  
                                   volta_dgemm_64x64_tn         0.00%       0.000us         0.00%       0.000us       0.000us        4.189s        10.37%        4.189s      87.264ms            48  
                                   volta_dgemm_64x64_nt         0.00%       0.000us         0.00%       0.000us       0.000us        4.167s        10.32%        4.167s      86.807ms            48  
void cudnn::detail::implicit_convolveNd_dgemm<3, 256...         0.00%       0.000us         0.00%       0.000us       0.000us        4.066s        10.07%        4.066s      56.475ms            72  
                                   volta_dgemm_64x64_nn         0.00%       0.000us         0.00%       0.000us       0.000us        2.863s         7.09%        2.863s     119.302ms            24  
void cudnn::detail::dgrad2d_alg1_1<double, 0, 7, 5, ...         0.00%       0.000us         0.00%       0.000us       0.000us     850.088ms         2.11%     850.088ms      35.420ms            24  
                                        model_inference         2.50%     371.894ms        98.07%       14.606s       14.606s      22.628ms         0.06%     598.763ms     598.763ms             1  
                                   DataParallel.forward         0.36%      54.157ms         0.53%      79.139ms       9.892ms       0.000us         0.00%     361.658ms      45.207ms             8  
                                            aten::copy_         0.02%       3.114ms         1.32%     197.140ms       2.738ms     288.328ms         0.71%     288.328ms       4.005ms            72  
                                              Broadcast         0.02%       2.992ms         0.10%      14.967ms       1.871ms     267.952ms         0.66%     270.388ms      33.798ms             8  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 14.894s
Self CUDA time total: 40.380s

And the statistics for Workstation 1 (problematic) with three A5000 GPU are listed as follows:

Workstation 1, 1GPU:
Use Cuda GPU!
Let's use 1 GPUs!
Epoch: 0

0.0009272070497193106
0.0003752760953701497
0.00031495944450507437
0.0002910588679051464
0.0002672624948959456
0.0002367020626337095
0.00022315971894642215
0.00020790642670759333
1.533455 minutes passed!

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               CudnnConvolutionBackward         0.00%     381.000us         2.22%        2.108s      52.692ms       0.000us         0.00%       54.972s        1.374s            40  
                       aten::cudnn_convolution_backward         0.00%     705.000us         2.22%        2.107s      52.683ms       0.000us         0.00%       54.972s        1.374s            40  
                 aten::cudnn_convolution_backward_input         0.01%       5.448ms         2.21%        2.103s      52.568ms       50.136s        58.59%       50.136s        1.253s            40  
void cudnn::detail::dgrad_alg1_engine<512, 6, 5, 3, ...         0.00%       0.000us         0.00%       0.000us       0.000us       44.616s        52.14%       44.616s       44.616s             1  
                                        model_inference         5.70%        5.414s        95.87%       91.039s       91.039s     464.128ms         0.54%       21.950s       21.950s             1  
                                      aten::convolution         0.00%     335.000us         0.49%     462.566ms      11.564ms       0.000us         0.00%       16.317s     407.926ms            40  
                                     aten::_convolution         0.00%     708.000us         0.49%     462.231ms      11.556ms       0.000us         0.00%       16.317s     407.926ms            40  
                                aten::cudnn_convolution         0.01%       6.173ms         0.48%     459.872ms      11.497ms       16.310s        19.06%       16.310s     407.752ms            40  
                                           aten::conv3d         0.00%     219.000us         0.00%       4.495ms     187.292us       0.000us         0.00%       16.239s     676.625ms            24  
void implicit_convolveNd_dgemm<3, 128, 6, 7, 3, 3, 5...         0.00%       0.000us         0.00%       0.000us       0.000us       15.896s        18.58%       15.896s     993.531ms            16  
                                               aten::mm         0.01%       9.218ms         2.98%        2.834s      70.847ms       13.538s        15.82%       13.538s     338.457ms            40  
                                             MmBackward         0.00%     292.000us         1.90%        1.802s     112.642ms       0.000us         0.00%        8.605s     537.802ms            16  
void convolveNd_dgrad_engine<3, 256, false, true, 6,...         0.00%       0.000us         0.00%       0.000us       0.000us        5.461s         6.38%        5.461s     227.562ms            24  
void cutlass::Kernel<cutlass_80_tensorop_d884gemm_64...         0.00%       0.000us         0.00%       0.000us       0.000us        5.058s         5.91%        5.058s     316.115ms            16  
                                           aten::linear         0.00%     350.000us         4.21%        3.995s     249.676ms       0.000us         0.00%        4.945s     309.063ms            16  
                                           aten::matmul         0.00%     332.000us         1.09%        1.033s      64.563ms       0.000us         0.00%        4.933s     308.341ms            16  
                aten::cudnn_convolution_backward_weight         0.00%       2.785ms         0.00%       3.894ms      97.350us        4.837s         5.65%        4.837s     120.921ms            40  
void convolve_wgradNd_engine<3, 128, 5, 5, 3, 3, 3, ...         0.00%       0.000us         0.00%       0.000us       0.000us        4.772s         5.58%        4.772s     198.844ms            24  
void cutlass::Kernel<cutlass_80_tensorop_d884gemm_64...         0.00%       0.000us         0.00%       0.000us       0.000us        3.547s         4.15%        3.547s     443.373ms             8  
void cutlass::Kernel<cutlass_80_tensorop_d884gemm_12...         0.00%       0.000us         0.00%       0.000us       0.000us        3.281s         3.83%        3.281s     468.684ms             7  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 94.957s
Self CUDA time total: 85.568s

Epoch: 1

0.00018144507594206833
0.0001743677954530191
0.0001600232324589636
0.00014879743250654224
0.0001384801070815437
0.0001225694448793762
0.00012302444233999212
0.0001217971922400595
1.427343 minutes passed!

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               CudnnConvolutionBackward         0.00%     326.000us         0.01%       7.464ms     186.600us       0.000us         0.00%       54.125s        1.353s            40  
                       aten::cudnn_convolution_backward         0.00%     643.000us         0.01%       7.138ms     178.450us       0.000us         0.00%       54.125s        1.353s            40  
                 aten::cudnn_convolution_backward_input         0.00%       2.242ms         0.00%       3.131ms      78.275us       49.441s        59.84%       49.441s        1.236s            40  
void cudnn::detail::dgrad_alg1_engine<512, 6, 5, 3, ...         0.00%       0.000us         0.00%       0.000us       0.000us       45.787s        55.42%       45.787s       45.787s             1  
                                        model_inference         0.30%     251.531ms        99.98%       85.017s       85.017s      58.326ms         0.07%       19.903s       19.903s             1  
                                      aten::convolution         0.00%     254.000us         0.01%       5.951ms     148.775us       0.000us         0.00%       16.243s     406.067ms            40  
                                     aten::_convolution         0.00%     612.000us         0.01%       5.697ms     142.425us       0.000us         0.00%       16.243s     406.067ms            40  
                                aten::cudnn_convolution         0.00%       2.923ms         0.00%       3.794ms      94.850us       16.236s        19.65%       16.236s     405.899ms            40  
                                           aten::conv3d         0.00%     178.000us         0.00%       3.000ms     125.000us       0.000us         0.00%       16.165s     673.540ms            24  
void implicit_convolveNd_dgemm<3, 128, 6, 7, 3, 3, 5...         0.00%       0.000us         0.00%       0.000us       0.000us       15.819s        19.15%       15.819s     988.664ms            16  
                                               aten::mm         0.00%       1.834ms         0.00%       2.973ms      74.325us       12.097s        14.64%       12.097s     302.414ms            40  
                                             MmBackward         0.00%     269.000us         0.00%       2.053ms     128.312us       0.000us         0.00%        8.552s     534.500ms            16  
void cutlass::Kernel<cutlass_80_tensorop_d884gemm_64...         0.00%       0.000us         0.00%       0.000us       0.000us        5.003s         6.06%        5.003s     312.702ms            16  
                aten::cudnn_convolution_backward_weight         0.00%       2.285ms         0.00%       3.364ms      84.100us        4.685s         5.67%        4.685s     117.115ms            40  
void convolve_wgradNd_engine<3, 128, 5, 5, 3, 3, 3, ...         0.00%       0.000us         0.00%       0.000us       0.000us        4.621s         5.59%        4.621s     192.548ms            24  
void convolveNd_dgrad_engine<3, 256, false, true, 6,...         0.00%       0.000us         0.00%       0.000us       0.000us        3.597s         4.35%        3.597s     149.855ms            24  
                                           aten::linear         0.00%     250.000us         0.00%       3.172ms     198.250us       0.000us         0.00%        3.556s     222.244ms            16  
void cutlass::Kernel<cutlass_80_tensorop_d884gemm_64...         0.00%       0.000us         0.00%       0.000us       0.000us        3.549s         4.30%        3.549s     443.595ms             8  
                                           aten::matmul         0.00%     232.000us         0.00%       2.092ms     130.750us       0.000us         0.00%        3.545s     221.534ms            16  
void cutlass::Kernel<cutlass_80_tensorop_d884gemm_12...         0.00%       0.000us         0.00%       0.000us       0.000us        3.314s         4.01%        3.314s     473.412ms             7  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 85.034s
Self CUDA time total: 82.620s

Workstation 1, 2 GPUs working in parallel:
Use Cuda GPU!
Let's use 2 GPUs!
Epoch: 0

0.0040544034369015755
0.0008194306166093237
0.0004000130152196939
0.0003876064496647676
0.000310091377404055
0.00019129871352052222
0.00016230525331223535
0.00019453547816230253
6.192227 minutes passed!

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               CudnnConvolutionBackward         0.00%       1.817ms         0.01%      52.025ms     650.312us       0.000us         0.00%      679.478s        8.493s            80  
                       aten::cudnn_convolution_backward         0.00%       6.669ms         0.01%      50.208ms     627.600us       0.000us         0.00%      679.478s        8.493s            80  
                 aten::cudnn_convolution_backward_input         0.00%      12.998ms         0.00%      25.172ms     314.650us      674.194s        93.78%      674.194s        8.427s            80  
void cudnn::detail::dgrad_alg1_engine<512, 6, 5, 3, ...         0.00%       0.000us         0.00%       0.000us       0.000us      621.258s        86.42%      621.258s       44.376s            14  
void cudnn::detail::dgrad_alg1_engine<128, 5, 5, 3, ...         0.00%       0.000us         0.00%       0.000us       0.000us       44.641s         6.21%       44.641s       22.320s             2  
void implicit_convolveNd_dgemm<3, 128, 6, 7, 3, 3, 5...         0.00%       0.000us         0.00%       0.000us       0.000us       15.373s         2.14%       15.373s     480.420ms            32  
                                      BroadcastBackward         0.00%     564.000us         9.14%       46.517s        5.815s       0.000us         0.00%        9.480s        1.185s             8  
                                     ReduceAddCoalesced         9.14%       46.501s         9.14%       46.517s        5.815s        9.479s         1.32%        9.480s        1.185s             8  
               ncclReduceRingLLKernel_sum_f64(ncclColl)         0.00%       0.000us         0.00%       0.000us       0.000us        9.479s         1.32%        9.479s     148.117ms            64  
                                             MmBackward         0.00%       3.753ms         0.00%      15.253ms     476.656us       0.000us         0.00%        8.403s     262.587ms            32  
                                               aten::mm         0.00%       3.920ms         0.00%       8.917ms     185.771us        8.403s         1.17%        8.403s     175.058ms            48  
void convolveNd_dgrad_engine<3, 256, false, true, 6,...         0.00%       0.000us         0.00%       0.000us       0.000us        8.283s         1.15%        8.283s     172.556ms            48  
                aten::cudnn_convolution_backward_weight         0.00%      11.219ms         0.00%      18.367ms     229.588us        5.284s         0.74%        5.284s      66.052ms            80  
void convolve_wgradNd_engine<3, 128, 5, 5, 3, 3, 3, ...         0.00%       0.000us         0.00%       0.000us       0.000us        5.218s         0.73%        5.218s     108.698ms            48  
void cutlass::Kernel<cutlass_80_tensorop_d884gemm_64...         0.00%       0.000us         0.00%       0.000us       0.000us        4.937s         0.69%        4.937s     154.270ms            32  
void cutlass::Kernel<cutlass_80_tensorop_d884gemm_64...         0.00%       0.000us         0.00%       0.000us       0.000us        3.466s         0.48%        3.466s     216.633ms            16  
void cutlass::Kernel<cutlass_80_tensorop_d884gemm_32...         0.00%       0.000us         0.00%       0.000us       0.000us        3.426s         0.48%        3.426s     214.122ms            16  
void cutlass::Kernel<cutlass_80_tensorop_d884gemm_64...         0.00%       0.000us         0.00%       0.000us       0.000us        1.420s         0.20%        1.420s      88.773ms            16  
                                        model_inference         9.82%       49.989s        72.65%      369.789s      369.789s     421.474ms         0.06%     850.378ms     850.378ms             1  
                       Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us     421.496ms         0.06%     421.496ms       3.099ms           136  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 509.012s
Self CUDA time total: 718.905s

Epoch: 1

0.00021934110920382598
0.00023363136428638785
0.00021870171749141336
0.00020551554557354
0.00017651321924490616
0.00015204309430480085
0.0001667849194120618
0.00017974811397844192
6.040423 minutes passed!

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               CudnnConvolutionBackward         0.00%       2.109ms         0.01%      50.901ms     636.263us       0.000us         0.00%      682.018s        8.525s            80  
                       aten::cudnn_convolution_backward         0.00%       5.480ms         0.01%      48.792ms     609.900us       0.000us         0.00%      682.018s        8.525s            80  
                 aten::cudnn_convolution_backward_input         0.00%      14.127ms         0.01%      22.035ms     275.438us      676.651s        93.99%      676.651s        8.458s            80  
void cudnn::detail::dgrad_alg1_engine<512, 6, 5, 3, ...         0.00%       0.000us         0.00%       0.000us       0.000us      623.678s        86.63%      623.678s       44.548s            14  
void cudnn::detail::dgrad_alg1_engine<128, 5, 5, 3, ...         0.00%       0.000us         0.00%       0.000us       0.000us       44.453s         6.17%       44.453s       22.227s             2  
void implicit_convolveNd_dgemm<3, 128, 6, 7, 3, 3, 5...         0.00%       0.000us         0.00%       0.000us       0.000us       15.740s         2.19%       15.740s     491.883ms            32  
void convolveNd_dgrad_engine<3, 256, false, true, 6,...         0.00%       0.000us         0.00%       0.000us       0.000us        8.507s         1.18%        8.507s     177.226ms            48  
                                             MmBackward         0.00%       1.960ms         0.00%      14.646ms     457.688us       0.000us         0.00%        8.357s     261.151ms            32  
                                               aten::mm         0.00%       4.850ms         0.00%       9.045ms     188.438us        8.357s         1.16%        8.357s     174.101ms            48  
                                      BroadcastBackward         0.00%     536.000us         0.01%      30.782ms       3.848ms       0.000us         0.00%        7.875s     984.383ms             8  
                                     ReduceAddCoalesced         0.00%      15.328ms         0.01%      30.246ms       3.781ms        7.875s         1.09%        7.875s     984.383ms             8  
               ncclReduceRingLLKernel_sum_f64(ncclColl)         0.00%       0.000us         0.00%       0.000us       0.000us        7.875s         1.09%        7.875s     123.047ms            64  
                aten::cudnn_convolution_backward_weight         0.00%      12.964ms         0.01%      21.277ms     265.962us        5.367s         0.75%        5.367s      67.090ms            80  
void convolve_wgradNd_engine<3, 128, 5, 5, 3, 3, 3, ...         0.00%       0.000us         0.00%       0.000us       0.000us        5.300s         0.74%        5.300s     110.426ms            48  
void cutlass::Kernel<cutlass_80_tensorop_d884gemm_64...         0.00%       0.000us         0.00%       0.000us       0.000us        4.900s         0.68%        4.900s     153.116ms            32  
void cutlass::Kernel<cutlass_80_tensorop_d884gemm_32...         0.00%       0.000us         0.00%       0.000us       0.000us        3.533s         0.49%        3.533s     220.821ms            16  
void cutlass::Kernel<cutlass_80_tensorop_d884gemm_64...         0.00%       0.000us         0.00%       0.000us       0.000us        3.457s         0.48%        3.457s     216.070ms            16  
void cutlass::Kernel<cutlass_80_tensorop_d884gemm_64...         0.00%       0.000us         0.00%       0.000us       0.000us        1.438s         0.20%        1.438s      89.852ms            16  
                                        model_inference         0.11%     398.925ms        99.96%      361.230s      361.230s      56.032ms         0.01%     494.470ms     494.470ms             1  
                                            aten::copy_         0.00%       2.888ms         0.06%     233.377ms       3.241ms     416.612ms         0.06%     416.612ms       5.786ms            72  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 361.389s
Self CUDA time total: 719.940s

Workstation 1 is currently working, so I don’t want to stop it (change CUDNN version), does the statistics means that I need to try other CUDNN versions (It is CUDNN 8.2.1 & for CUDA 11.3 now) ? Or does it indicates other problems?

Thanks!

You have almost extra minute of CPU load on A5000 workstation for a single card run.

@my3bikaht Thanks. I already use data.to(device) and mode.to(device) to load them into CUDA GPU , assuming data are the training set loaded by DataLoader and model is the network. But I don’t know what might be the cause of the extra load on CPU. Can you explain further about the solution to the problem?

PS:model_inference is the extra minutes caused by turning on the Pytorch Log.

Yup, I missed that CUDA time. What if you enable optimal cudnn algorithm selection by setting torch.backends.cudnn.benchmark = True ?

1 Like

@my3bikaht Thanks for your solutions, now it is boosting. Thanks a lot!
It is so great that I can learn new things every time I search in the forum.