Possible bug in P2P (device-to-device) copy overlapped with work on GPU

yurib · December 25, 2018, 3:32pm

Following very simple code prints value of 1 to output (tensor(1., device=‘cuda:1’)); instead of 0, i’d expect to see
Using default stream (instead of custom) ‘fixes’ the issue, but defeats the purpose of original code - copy from one GPU to several others at the same time (async). Would be happy for any other solutions which archive this goal?

can only guess that something odd happens with memory when “x1 + 1” starts executing (see nvprof) and at the same time device-to-device copy starts happening

pytorch 1.0, CUDA 10

12.9429s 4.6388ms (878907 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - ZN2at6native18elementwise_kernelILi512ELi1EZNS0_16gpu_unary_kernelIZNS0_17gpu_binary_kernelIZNS0_15add_kernel_implIfEEvRNS_14TensorIteratorEN3c106ScalarEEUlffE_EEvS6_RKT_EUlfE0_EEvS6_SC_EUliE_EEviT1 [886]
12.9468s 74.185ms - - - - - 3.3528GB 45.195GB/s Device Device Tesla V100-SXM2 1 28 Tesla V100-SXM2 1 Tesla V100-SXM2 2 [CUDA memcpy PtoP]

import torch

n = int(3e4)

d0 = torch.device(‘cuda:0’)
x0 = torch.zeros(n, n, device=d0)

d1 = torch.device(‘cuda:1’)
x1 = torch.zeros(n, n, device=d1)
x1 + 1
x1 + 1

s = torch.cuda.Stream()
with torch.cuda.stream(s):
x2 = x0.to(d1)

torch.cuda.synchronize() # sync on dev1 doesn’t help, using events don’t help either

print(x2[0, 0])

nvprof

Start Duration 6.69668s 2.0160us 13.0215s 1.9520us 13.0261s 4.7312ms 13.0292s 74.185ms 13.0308s 5.0902ms 13.0359s 5.0761ms 13.0410s 5.0793ms 13.1045s 2.6560us 13.1046s 1.5040us 13.1047s 1.7280us 13.1048s 1.6630us 13.1048s 1.1840us 13.1048s 1.2160us 13.1049s 3.6480us 13.1049s 3.2000us 13.1051s 1.9520us 13.1051s 1.3440us 13.1051s 7.0080us 13.1052s 1.7600us 13.1052s 1.2480us 13.1053s 1.8240us 13.1054s 5.6320us 13.1054s 1.8880us 13.1055s 1.3440us 13.1055s 5.4080us 13.1055s 1.5040us 13.1056s 1.1840us 13.1057s 1.5680us 13.1057s 1.5680us 13.1058s 1.6000us 13.1059s 2.3680us 13.1060s 1.5360us 13.1060s 1.4720us 13.1060s 1.2480us 13.1060s 1.4720us 13.1061s 1.4720us 13.1061s 1.9520us Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Src Dev Src Ctx Dst Dev Dst Ctx Name
- - - - - 3.3528GB 2e+06GB/s Device - Tesla V100-SXM2 1 7 - - - - [CUDA memset]
- - - - - 3.3528GB 2e+06GB/s Device - Tesla V100-SXM2 2 18 - - - - [CUDA memset]
(878907 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - ZN2at6native18elementwise_kernelILi512ELi1EZNS0_16gpu_unary_kernelIZNS0_17gpu_binary_kernelIZNS0_15add_kernel_implIfEEvRNS_14TensorIteratorEN3c106ScalarEEUlffE_EEvS6_RKT_EUlfE0_EEvS6_SC_EUliE_EEviT1 [886]
- - - - - 3.3528GB 45.194GB/s Device Device Tesla V100-SXM2 1 28 Tesla V100-SXM2 1 Tesla V100-SXM2 2 [CUDA memcpy PtoP]
(878907 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - ZN2at6native18elementwise_kernelILi512ELi1EZNS0_16gpu_unary_kernelIZNS0_17gpu_binary_kernelIZNS0_15add_kernel_implIfEEvRNS_14TensorIteratorEN3c106ScalarEEUlffE_EEvS6_RKT_EUlfE0_EEvS6_SC_EUliE_EEviT1 [889]
(878907 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - ZN2at6native18elementwise_kernelILi512ELi1EZNS0_16gpu_unary_kernelIZNS0_17gpu_binary_kernelIZNS0_15add_kernel_implIfEEvRNS_14TensorIteratorEN3c106ScalarEEUlffE_EEvS6_RKT_EUlfE0_EEvS6_SC_EUliE_EEviT1 [902]
(878907 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - ZN2at6native18elementwise_kernelILi512ELi1EZNS0_16gpu_unary_kernelIZNS0_17gpu_binary_kernelIZNS0_15add_kernel_implIfEEvRNS_14TensorIteratorEN3c106ScalarEEUlffE_EEvS6_RKT_EUlfE0_EEvS6_SC_EUliE_EEviT1 [905]
(1 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - void kernelPointwiseApply3<TensorEQOp<float, unsigned char>, unsigned char, float, float, unsigned int, int=1, int=1, int=1>(OffsetInfo<unsigned char, float, float>, OffsetInfo<TensorEQOp<float, unsigned char>, float, unsigned int>, OffsetInfo<unsigned char, float, int=1>, float, float) [1014]
(1 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - void kernelPointwiseApply2<Tensor_abs_Float_Op, float, float, unsigned int, int=1, int=1>(OffsetInfo<float, unsigned int, int=1>, OffsetInfo<float, unsigned int, int=1>, unsigned int, Tensor_abs_Float_Op) [1028]
(1 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - void kernelPointwiseApply2<TensorNEValueOp<float, unsigned char>, unsigned char, float, unsigned int, int=1, int=1>(OffsetInfo<unsigned char, unsigned char, float>, OffsetInfo<TensorNEValueOp<float, unsigned char>, unsigned char, unsigned int>, unsigned char, float) [1040]
(1 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - void kernelPointwiseApply3<TensorBitAndOp, unsigned char, unsigned char, unsigned char, unsigned int, int=1, int=1, int=1>(OffsetInfo<TensorBitAndOp, unsigned char, unsigned int>, OffsetInfo<unsigned char, unsigned char, int=1>, OffsetInfo<unsigned char, unsigned char, int=1>, unsigned char, unsigned char) [1051]
(1 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - void kernelPointwiseApply2<TensorNEValueOp<float, unsigned char>, unsigned char, float, unsigned int, int=1, int=1>(OffsetInfo<unsigned char, unsigned char, float>, OffsetInfo<TensorNEValueOp<float, unsigned char>, unsigned char, unsigned int>, unsigned char, float) [1062]
(1 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - void kernelPointwiseApply3<TensorBitAndOp, unsigned char, unsigned char, unsigned char, unsigned int, int=1, int=1, int=1>(OffsetInfo<TensorBitAndOp, unsigned char, unsigned int>, OffsetInfo<unsigned char, unsigned char, int=1>, OffsetInfo<unsigned char, unsigned char, int=1>, unsigned char, unsigned char) [1073]
(1 1 1) (1024 1 1) 30 0B 8.0000KB - - - - Tesla V100-SXM2 2 18 - - - - void kernelReduceAll<unsigned char, unsigned int, long, thrust::identity, ReduceAdd, int=1>(TensorInfo<unsigned char, unsigned int>, unsigned int, long, long, thrust::identity, long*) [1086]
- - - - - 8B 2.3842MB/s Device Pageable Tesla V100-SXM2 2 18 - - - - [CUDA memcpy DtoH]
(1 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - ZN2at4cuda75_GLOBAL__N__51_tmpxft_000055a3_00000000_11_Copy_compute_75_cpp1_ii_dd3fb9a321kernelPointwiseApply2IZN75_GLOBAL__N__51_tmpxft_000055a3_00000000_11_Copy_compute_75_cpp1_ii_dd3fb9a36CopyOpIlhE5applyERNS_6TensorERKS6_EUlRlRKhE_lhjLi1ELi1ELi1EEEvNS0_6detail10TensorInfoIT0_T2_EENSF_IT1_SH_EESH_T [1101]
(1 1 1) (128 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__scan::InitAgent<thrust::cuda_cub::cub::ScanTileState<long, bool=1>, int>, thrust::cuda_cub::cub::ScanTileState<long, bool=1>, int>(bool=1, thrust::cuda_cub::cub::ScanTileState<long, bool=1>) [1121]
(1 1 1) (96 1 1) 42 0B 2.0781KB - - - - Tesla V100-SXM2 2 18 - - - - void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__scan::ScanAgent<thrust::device_ptr, thrust::device_ptr, thrust::plus, int, long, thrust::detail::integral_constant<bool, bool=0>>, thrust::device_ptr, thrust::device_ptr, thrust::plus, int, thrust::cuda_cub::cub::ScanTileState<long, bool=1>, thrust::cuda_cub::__scan::AddInitToExclusiveScan<long, thrust::plus>>(thrust::device_ptr, thrust::device_ptr, long, thrust::plus, int, long) [1126]
(1 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - void kernelPointwiseApply3<TensorMaskedSelectOp<float, unsigned char, long>, unsigned char, long, float, unsigned int, int=1, int=1, int=1>(OffsetInfo<unsigned char, unsigned char, long>, OffsetInfo<long, unsigned char, float>, OffsetInfo<TensorMaskedSelectOp<float, unsigned char, long>, unsigned char, unsigned int>, unsigned char, float) [1132]
(1 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - void kernelPointwiseApply2<Tensor_abs_Float_Op, float, float, unsigned int, int=1, int=1>(OffsetInfo<float, unsigned int, int=1>, OffsetInfo<float, unsigned int, int=1>, unsigned int, Tensor_abs_Float_Op) [1146]
(1 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - ZN2at4cuda75_GLOBAL__N__51_tmpxft_000055a3_00000000_11_Copy_compute_75_cpp1_ii_dd3fb9a321kernelPointwiseApply2IZN75_GLOBAL__N__51_tmpxft_000055a3_00000000_11_Copy_compute_75_cpp1_ii_dd3fb9a36CopyOpIdfE5applyERNS_6TensorERKS6_EUlRdRKfE_dfjLi1ELi1ELi1EEEvNS0_6detail10TensorInfoIT0_T2_EENSF_IT1_SH_EESH_T [1159]
(1 1 1) (1024 1 1) 16 0B 8.0000KB - - - - Tesla V100-SXM2 2 18 - - - - void kernelReduceAll<double, unsigned int, double, thrust::identity, ReduceMin, int=1>(TensorInfo<double, unsigned int>, unsigned int, double, double, thrust::identity, double*) [1172]
- - - - - 8B 4.0410MB/s Device Pageable Tesla V100-SXM2 2 18 - - - - [CUDA memcpy DtoH]
(1 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - void kernelPointwiseApply1<TensorFillOp, double, unsigned int, int=1>(OffsetInfo<TensorFillOp, double, unsigned int>, double, double) [1184]
(1 1 1) (1024 1 1) 16 0B 8.0000KB - - - - Tesla V100-SXM2 2 18 - - - - void kernelReduceAll<double, unsigned int, double, thrust::identity, ReduceMax, int=1>(TensorInfo<double, unsigned int>, unsigned int, double, double, thrust::identity, double*) [1196]
- - - - - 8B 5.0727MB/s Device Pageable Tesla V100-SXM2 2 18 - - - - [CUDA memcpy DtoH]
(1 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - void kernelPointwiseApply1<TensorFillOp, double, unsigned int, int=1>(OffsetInfo<TensorFillOp, double, unsigned int>, double, double) [1208]
(1 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - void kernelPointwiseApply2<Tensor_ceil_Float_Op, float, float, unsigned int, int=1, int=1>(OffsetInfo<float, unsigned int, int=1>, OffsetInfo<float, unsigned int, int=1>, unsigned int, Tensor_ceil_Float_Op) [1226]
(1 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - void kernelPointwiseApply3<TensorNEOp<float, unsigned char>, unsigned char, float, float, unsigned int, int=1, int=1, int=1>(OffsetInfo<unsigned char, float, float>, OffsetInfo<TensorNEOp<float, unsigned char>, float, unsigned int>, OffsetInfo<unsigned char, float, int=1>, float, float) [1238]
- - - - - 1B 610.35KB/s Device Pageable Tesla V100-SXM2 2 18 - - - - [CUDA memcpy DtoH]
(1 1 1) (128 1 1) 26 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - ZN2at6native18elementwise_kernelILi128ELi4EZNS0_17gpu_binary_kernelIZNS0_15div_kernel_implIdEEvRNS_14TensorIteratorEEUlddE_EEvS5_RKT_EUliE0_EEviT1 [1259]
(1 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - void kernelPointwiseApply2<TensorGTValueOp<double, unsigned char>, unsigned char, double, unsigned int, int=1, int=1>(OffsetInfo<unsigned char, unsigned char, double>, OffsetInfo<TensorGTValueOp<double, unsigned char>, unsigned char, unsigned int>, unsigned char, double) [1270]
- - - - - 1B 663.42KB/s Device Pageable Tesla V100-SXM2 2 18 - - - - [CUDA memcpy DtoH]
(1 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - void kernelPointwiseApply2<TensorGTValueOp<double, unsigned char>, unsigned char, double, unsigned int, int=1, int=1>(OffsetInfo<unsigned char, unsigned char, double>, OffsetInfo<TensorGTValueOp<double, unsigned char>, unsigned char, unsigned int>, unsigned char, double) [1289]
- - - - - 1B 663.42KB/s Device Pageable Tesla V100-SXM2 2 18 - - - - [CUDA memcpy DtoH]
- - - - - 4B 2.5915MB/s Device Pageable Tesla V100-SXM2 2 18 - - - - [CUDA memcpy DtoH]
- - - - - 4B 1.9543MB/s Device Pageable Tesla V100-SXM2 2 18 - - - - [CUDA memcpy DtoH]

smth · December 29, 2018, 7:21am

I’ve answered on the issue https://github.com/pytorch/pytorch/issues/15568