In my code, I used torch.nn.Dataparallel to speed up the process of training, and it is true that 2 gpus are used (The gpus are TitanX). But the cost time of forward, backward and optimizer.step for single and double gpus are as follows(each coloum means the cost time of each epoch, forward, backward, update parameters of one batch):
As you can see, the cost time of each epoch for different batch size is different, and as batch size increases, the cost time is decreasing. However, the weird thing is that the cost time of forward time for different batch size is same.
Then I use nvprof
to check the running log of CUDA. The snippet log of single and double GPU are presented as below:
Single GPU
.....
0.00% 6.4000us 1 6.4000us 6.4000us 6.4000us void cudnn::winograd_nonfused::winogradForwardFilter4x4<float, float>(cudnn::winograd_nonfused::WinogradFilterParams<float, float>)
API calls: 71.29% 124.052s 7147 17.357ms 8.1660us 13.6981s cudaMalloc
21.03% 36.5876s 3595 10.177ms 1.4340us 25.420ms cudaFree
3.93% 6.83755s 540248 12.656us 3.7030us 25.729ms cudaLaunchKernel
1.10% 1.91393s 95923 19.952us 1.6960us 1.24719s cudaMemsetAsync
0.64% 1.11508s 60439 18.449us 2.6610us 35.854ms cudaMemcpyAsync
0.64% 1.10983s 4094479 271ns 174ns 881.98us cudaGetDevice
0.56% 970.50ms 53 18.311ms 10.491ms 26.820ms cudaMemGetInfo
0.30% 519.20ms 1443322 359ns 210ns 859.60us cudaSetDevice
0.10% 178.46ms 2 89.228ms 86.994ms 91.463ms cudaGetDeviceProperties
0.07% 127.21ms 262 485.55us 9.6080us 19.629ms cudaEventSynchronize
0.07% 118.39ms 46675 2.5360us 962ns 256.32us cudaBindTexture
0.06% 100.14ms 673242 148ns 70ns 776.79us cudaGetLastError
....
Double GPU
....
0.00% 4.7360us 4 1.1840us 1.0240us 1.3440us _ZN2at4cuda75_GLOBAL__N__51_tmpxft_00005a38_00000000_11_Copy_compute_75_cpp1_ii_dd3fb9a321kernelPointwiseApply2IZN75_GLOBAL__N__51_tmpxft_00005a38_00000000_11_Copy_compute_75_cpp1_ii_dd3fb9a36CopyOpIfhE5applyERNS_6TensorERKS6_EUlRfRKhE_fhjLi1ELi1ELi1EEEvNS0_6detail10TensorInfoIT0_T2_EENSF_IT1_SH_EESH_T_
0.00% 4.7040us 4 1.1760us 896ns 1.4400us void kernelPointwiseApply3<TensorAddCMulOp<float>, float, float, float, unsigned int, int=1, int=1, int=1>(OffsetInfo<TensorAddCMulOp<float>, float, unsigned int>, OffsetInfo<float, float, int=1>, OffsetInfo<float, float, int=1>, float, float)
0.00% 3.4240us 2 1.7120us 1.6320us 1.7920us void kernelPointwiseApply3<TensorEQOp<long, unsigned char>, unsigned char, long, long, unsigned int, int=1, int=2, int=2>(OffsetInfo<unsigned char, long, long>, OffsetInfo<TensorEQOp<long, unsigned char>, long, unsigned int>, OffsetInfo<unsigned char, long, int=1>, long, long)
API calls: 69.26% 214.356s 11803 18.161ms 8.7010us 13.8486s cudaMalloc
21.00% 64.9878s 5936 10.948ms 1.3770us 38.564ms cudaFree
7.20% 22.2947s 2 11.1474s 40.591ms 22.2541s cudaDeviceEnablePeerAccess
0.83% 2.57946s 209063 12.338us 3.7060us 11.642ms cudaLaunchKernel
0.63% 1.93821s 94 20.619ms 5.5289ms 46.548ms cudaMemGetInfo
0.27% 822.61ms 1766882 465ns 173ns 924.95us cudaGetDevice
0.12% 371.45ms 4 92.864ms 82.527ms 98.761ms cudaGetDeviceProperties
0.11% 351.81ms 614794 572ns 188ns 782.52us cudaSetDevice
0.10% 295.81ms 8 36.976ms 29.369ms 41.138ms cudaHostRegister
0.08% 242.25ms 20897 11.592us 2.7710us 20.411ms cudaMemcpyAsync
0.06% 199.40ms 36125 5.5190us 1.8260us 12.870ms cudaMemsetAsync
0.06% 198.11ms 192 1.0318ms 236ns 46.915ms cuDeviceGetAttribute
....
After comparing two logs, I find the difference is that, in the log of double gpus, it has such line log which doesn’t show in single version:
7.20% 22.2947s 2 11.1474s 40.591ms 22.2541s cudaDeviceEnablePeerAccess
So how can I solve such problem?