Why do two gpus run slower than one gpu?

In my code, I used torch.nn.Dataparallel to speed up the process of training, and it is true that 2 gpus are used (The gpus are TitanX). But the cost time of forward, backward and optimizer.step for single and double gpus are as follows(each coloum means the cost time of each epoch, forward, backward, update parameters of one batch):

image

As you can see, the cost time of each epoch for different batch size is different, and as batch size increases, the cost time is decreasing. However, the weird thing is that the cost time of forward time for different batch size is same.

Then I use nvprof to check the running log of CUDA. The snippet log of single and double GPU are presented as below:

Single GPU

.....
                    0.00%  6.4000us         1  6.4000us  6.4000us  6.4000us  void cudnn::winograd_nonfused::winogradForwardFilter4x4<float, float>(cudnn::winograd_nonfused::WinogradFilterParams<float, float>)
      API calls:   71.29%  124.052s      7147  17.357ms  8.1660us  13.6981s  cudaMalloc
                   21.03%  36.5876s      3595  10.177ms  1.4340us  25.420ms  cudaFree
                    3.93%  6.83755s    540248  12.656us  3.7030us  25.729ms  cudaLaunchKernel
                    1.10%  1.91393s     95923  19.952us  1.6960us  1.24719s  cudaMemsetAsync
                    0.64%  1.11508s     60439  18.449us  2.6610us  35.854ms  cudaMemcpyAsync
                    0.64%  1.10983s   4094479     271ns     174ns  881.98us  cudaGetDevice
                    0.56%  970.50ms        53  18.311ms  10.491ms  26.820ms  cudaMemGetInfo
                    0.30%  519.20ms   1443322     359ns     210ns  859.60us  cudaSetDevice
                    0.10%  178.46ms         2  89.228ms  86.994ms  91.463ms  cudaGetDeviceProperties
                    0.07%  127.21ms       262  485.55us  9.6080us  19.629ms  cudaEventSynchronize
                    0.07%  118.39ms     46675  2.5360us     962ns  256.32us  cudaBindTexture
                    0.06%  100.14ms    673242     148ns      70ns  776.79us  cudaGetLastError
....

Double GPU

....
                   0.00%  4.7360us         4  1.1840us  1.0240us  1.3440us  _ZN2at4cuda75_GLOBAL__N__51_tmpxft_00005a38_00000000_11_Copy_compute_75_cpp1_ii_dd3fb9a321kernelPointwiseApply2IZN75_GLOBAL__N__51_tmpxft_00005a38_00000000_11_Copy_compute_75_cpp1_ii_dd3fb9a36CopyOpIfhE5applyERNS_6TensorERKS6_EUlRfRKhE_fhjLi1ELi1ELi1EEEvNS0_6detail10TensorInfoIT0_T2_EENSF_IT1_SH_EESH_T_
                    0.00%  4.7040us         4  1.1760us     896ns  1.4400us  void kernelPointwiseApply3<TensorAddCMulOp<float>, float, float, float, unsigned int, int=1, int=1, int=1>(OffsetInfo<TensorAddCMulOp<float>, float, unsigned int>, OffsetInfo<float, float, int=1>, OffsetInfo<float, float, int=1>, float, float)
                    0.00%  3.4240us         2  1.7120us  1.6320us  1.7920us  void kernelPointwiseApply3<TensorEQOp<long, unsigned char>, unsigned char, long, long, unsigned int, int=1, int=2, int=2>(OffsetInfo<unsigned char, long, long>, OffsetInfo<TensorEQOp<long, unsigned char>, long, unsigned int>, OffsetInfo<unsigned char, long, int=1>, long, long)
      API calls:   69.26%  214.356s     11803  18.161ms  8.7010us  13.8486s  cudaMalloc
                   21.00%  64.9878s      5936  10.948ms  1.3770us  38.564ms  cudaFree
                    7.20%  22.2947s         2  11.1474s  40.591ms  22.2541s  cudaDeviceEnablePeerAccess
                    0.83%  2.57946s    209063  12.338us  3.7060us  11.642ms  cudaLaunchKernel
                    0.63%  1.93821s        94  20.619ms  5.5289ms  46.548ms  cudaMemGetInfo
                    0.27%  822.61ms   1766882     465ns     173ns  924.95us  cudaGetDevice
                    0.12%  371.45ms         4  92.864ms  82.527ms  98.761ms  cudaGetDeviceProperties
                    0.11%  351.81ms    614794     572ns     188ns  782.52us  cudaSetDevice
                    0.10%  295.81ms         8  36.976ms  29.369ms  41.138ms  cudaHostRegister
                    0.08%  242.25ms     20897  11.592us  2.7710us  20.411ms  cudaMemcpyAsync
                    0.06%  199.40ms     36125  5.5190us  1.8260us  12.870ms  cudaMemsetAsync
                    0.06%  198.11ms       192  1.0318ms     236ns  46.915ms  cuDeviceGetAttribute
....

After comparing two logs, I find the difference is that, in the log of double gpus, it has such line log which doesn’t show in single version:

7.20% 22.2947s 2 11.1474s 40.591ms 22.2541s cudaDeviceEnablePeerAccess

So how can I solve such problem?

hello, I met the same question. did you find the answer?

It looks like your batches are too small. If your batch size is 128, that means the 2 gpu case has a batch of 64 per GPU. Also your profile shows almost all the time in cuda memory allocation.

Try making the batch as large as you can, then see how things behave.