Possible bug in P2P (device-to-device) copy overlapped with work on GPU

Following very simple code prints value of 1 to output (tensor(1., device=‘cuda:1’)); instead of 0, i’d expect to see
Using default stream (instead of custom) ‘fixes’ the issue, but defeats the purpose of original code - copy from one GPU to several others at the same time (async). Would be happy for any other solutions which archive this goal?

can only guess that something odd happens with memory when “x1 + 1” starts executing (see nvprof) and at the same time device-to-device copy starts happening

pytorch 1.0, CUDA 10

12.9429s 4.6388ms (878907 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - ZN2at6native18elementwise_kernelILi512ELi1EZNS0_16gpu_unary_kernelIZNS0_17gpu_binary_kernelIZNS0_15add_kernel_implIfEEvRNS_14TensorIteratorEN3c106ScalarEEUlffE_EEvS6_RKT_EUlfE0_EEvS6_SC_EUliE_EEviT1 [886]
12.9468s 74.185ms - - - - - 3.3528GB 45.195GB/s Device Device Tesla V100-SXM2 1 28 Tesla V100-SXM2 1 Tesla V100-SXM2 2 [CUDA memcpy PtoP]


import torch

n = int(3e4)

d0 = torch.device(‘cuda:0’)
x0 = torch.zeros(n, n, device=d0)

d1 = torch.device(‘cuda:1’)
x1 = torch.zeros(n, n, device=d1)
x1 + 1
x1 + 1

s = torch.cuda.Stream()
with torch.cuda.stream(s):
x2 = x0.to(d1)

torch.cuda.synchronize() # sync on dev1 doesn’t help, using events don’t help either

print(x2[0, 0])

nvprof

Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Src Dev Src Ctx Dst Dev Dst Ctx Name
6.69668s 2.0160us - - - - - 3.3528GB 2e+06GB/s Device - Tesla V100-SXM2 1 7 - - - - [CUDA memset]
13.0215s 1.9520us - - - - - 3.3528GB 2e+06GB/s Device - Tesla V100-SXM2 2 18 - - - - [CUDA memset]
13.0261s 4.7312ms (878907 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - ZN2at6native18elementwise_kernelILi512ELi1EZNS0_16gpu_unary_kernelIZNS0_17gpu_binary_kernelIZNS0_15add_kernel_implIfEEvRNS_14TensorIteratorEN3c106ScalarEEUlffE_EEvS6_RKT_EUlfE0_EEvS6_SC_EUliE_EEviT1 [886]
13.0292s 74.185ms - - - - - 3.3528GB 45.194GB/s Device Device Tesla V100-SXM2 1 28 Tesla V100-SXM2 1 Tesla V100-SXM2 2 [CUDA memcpy PtoP]
13.0308s 5.0902ms (878907 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - ZN2at6native18elementwise_kernelILi512ELi1EZNS0_16gpu_unary_kernelIZNS0_17gpu_binary_kernelIZNS0_15add_kernel_implIfEEvRNS_14TensorIteratorEN3c106ScalarEEUlffE_EEvS6_RKT_EUlfE0_EEvS6_SC_EUliE_EEviT1 [889]
13.0359s 5.0761ms (878907 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - ZN2at6native18elementwise_kernelILi512ELi1EZNS0_16gpu_unary_kernelIZNS0_17gpu_binary_kernelIZNS0_15add_kernel_implIfEEvRNS_14TensorIteratorEN3c106ScalarEEUlffE_EEvS6_RKT_EUlfE0_EEvS6_SC_EUliE_EEviT1 [902]
13.0410s 5.0793ms (878907 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - ZN2at6native18elementwise_kernelILi512ELi1EZNS0_16gpu_unary_kernelIZNS0_17gpu_binary_kernelIZNS0_15add_kernel_implIfEEvRNS_14TensorIteratorEN3c106ScalarEEUlffE_EEvS6_RKT_EUlfE0_EEvS6_SC_EUliE_EEviT1 [905]
13.1045s 2.6560us (1 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - void kernelPointwiseApply3<TensorEQOp<float, unsigned char>, unsigned char, float, float, unsigned int, int=1, int=1, int=1>(OffsetInfo<unsigned char, float, float>, OffsetInfo<TensorEQOp<float, unsigned char>, float, unsigned int>, OffsetInfo<unsigned char, float, int=1>, float, float) [1014]
13.1046s 1.5040us (1 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - void kernelPointwiseApply2<Tensor_abs_Float_Op, float, float, unsigned int, int=1, int=1>(OffsetInfo<float, unsigned int, int=1>, OffsetInfo<float, unsigned int, int=1>, unsigned int, Tensor_abs_Float_Op) [1028]
13.1047s 1.7280us (1 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - void kernelPointwiseApply2<TensorNEValueOp<float, unsigned char>, unsigned char, float, unsigned int, int=1, int=1>(OffsetInfo<unsigned char, unsigned char, float>, OffsetInfo<TensorNEValueOp<float, unsigned char>, unsigned char, unsigned int>, unsigned char, float) [1040]
13.1048s 1.6630us (1 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - void kernelPointwiseApply3<TensorBitAndOp, unsigned char, unsigned char, unsigned char, unsigned int, int=1, int=1, int=1>(OffsetInfo<TensorBitAndOp, unsigned char, unsigned int>, OffsetInfo<unsigned char, unsigned char, int=1>, OffsetInfo<unsigned char, unsigned char, int=1>, unsigned char, unsigned char) [1051]
13.1048s 1.1840us (1 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - void kernelPointwiseApply2<TensorNEValueOp<float, unsigned char>, unsigned char, float, unsigned int, int=1, int=1>(OffsetInfo<unsigned char, unsigned char, float>, OffsetInfo<TensorNEValueOp<float, unsigned char>, unsigned char, unsigned int>, unsigned char, float) [1062]
13.1048s 1.2160us (1 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - void kernelPointwiseApply3<TensorBitAndOp, unsigned char, unsigned char, unsigned char, unsigned int, int=1, int=1, int=1>(OffsetInfo<TensorBitAndOp, unsigned char, unsigned int>, OffsetInfo<unsigned char, unsigned char, int=1>, OffsetInfo<unsigned char, unsigned char, int=1>, unsigned char, unsigned char) [1073]
13.1049s 3.6480us (1 1 1) (1024 1 1) 30 0B 8.0000KB - - - - Tesla V100-SXM2 2 18 - - - - void kernelReduceAll<unsigned char, unsigned int, long, thrust::identity, ReduceAdd, int=1>(TensorInfo<unsigned char, unsigned int>, unsigned int, long, long, thrust::identity, long*) [1086]
13.1049s 3.2000us - - - - - 8B 2.3842MB/s Device Pageable Tesla V100-SXM2 2 18 - - - - [CUDA memcpy DtoH]
13.1051s 1.9520us (1 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - ZN2at4cuda75_GLOBAL__N__51_tmpxft_000055a3_00000000_11_Copy_compute_75_cpp1_ii_dd3fb9a321kernelPointwiseApply2IZN75_GLOBAL__N__51_tmpxft_000055a3_00000000_11_Copy_compute_75_cpp1_ii_dd3fb9a36CopyOpIlhE5applyERNS_6TensorERKS6_EUlRlRKhE_lhjLi1ELi1ELi1EEEvNS0_6detail10TensorInfoIT0_T2_EENSF_IT1_SH_EESH_T [1101]
13.1051s 1.3440us (1 1 1) (128 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__scan::InitAgent<thrust::cuda_cub::cub::ScanTileState<long, bool=1>, int>, thrust::cuda_cub::cub::ScanTileState<long, bool=1>, int>(bool=1, thrust::cuda_cub::cub::ScanTileState<long, bool=1>) [1121]
13.1051s 7.0080us (1 1 1) (96 1 1) 42 0B 2.0781KB - - - - Tesla V100-SXM2 2 18 - - - - void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__scan::ScanAgent<thrust::device_ptr, thrust::device_ptr, thrust::plus, int, long, thrust::detail::integral_constant<bool, bool=0>>, thrust::device_ptr, thrust::device_ptr, thrust::plus, int, thrust::cuda_cub::cub::ScanTileState<long, bool=1>, thrust::cuda_cub::__scan::AddInitToExclusiveScan<long, thrust::plus>>(thrust::device_ptr, thrust::device_ptr, long, thrust::plus, int, long) [1126]
13.1052s 1.7600us (1 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - void kernelPointwiseApply3<TensorMaskedSelectOp<float, unsigned char, long>, unsigned char, long, float, unsigned int, int=1, int=1, int=1>(OffsetInfo<unsigned char, unsigned char, long>, OffsetInfo<long, unsigned char, float>, OffsetInfo<TensorMaskedSelectOp<float, unsigned char, long>, unsigned char, unsigned int>, unsigned char, float) [1132]
13.1052s 1.2480us (1 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - void kernelPointwiseApply2<Tensor_abs_Float_Op, float, float, unsigned int, int=1, int=1>(OffsetInfo<float, unsigned int, int=1>, OffsetInfo<float, unsigned int, int=1>, unsigned int, Tensor_abs_Float_Op) [1146]
13.1053s 1.8240us (1 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - ZN2at4cuda75_GLOBAL__N__51_tmpxft_000055a3_00000000_11_Copy_compute_75_cpp1_ii_dd3fb9a321kernelPointwiseApply2IZN75_GLOBAL__N__51_tmpxft_000055a3_00000000_11_Copy_compute_75_cpp1_ii_dd3fb9a36CopyOpIdfE5applyERNS_6TensorERKS6_EUlRdRKfE_dfjLi1ELi1ELi1EEEvNS0_6detail10TensorInfoIT0_T2_EENSF_IT1_SH_EESH_T [1159]
13.1054s 5.6320us (1 1 1) (1024 1 1) 16 0B 8.0000KB - - - - Tesla V100-SXM2 2 18 - - - - void kernelReduceAll<double, unsigned int, double, thrust::identity, ReduceMin, int=1>(TensorInfo<double, unsigned int>, unsigned int, double, double, thrust::identity, double*) [1172]
13.1054s 1.8880us - - - - - 8B 4.0410MB/s Device Pageable Tesla V100-SXM2 2 18 - - - - [CUDA memcpy DtoH]
13.1055s 1.3440us (1 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - void kernelPointwiseApply1<TensorFillOp, double, unsigned int, int=1>(OffsetInfo<TensorFillOp, double, unsigned int>, double, double) [1184]
13.1055s 5.4080us (1 1 1) (1024 1 1) 16 0B 8.0000KB - - - - Tesla V100-SXM2 2 18 - - - - void kernelReduceAll<double, unsigned int, double, thrust::identity, ReduceMax, int=1>(TensorInfo<double, unsigned int>, unsigned int, double, double, thrust::identity, double*) [1196]
13.1055s 1.5040us - - - - - 8B 5.0727MB/s Device Pageable Tesla V100-SXM2 2 18 - - - - [CUDA memcpy DtoH]
13.1056s 1.1840us (1 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - void kernelPointwiseApply1<TensorFillOp, double, unsigned int, int=1>(OffsetInfo<TensorFillOp, double, unsigned int>, double, double) [1208]
13.1057s 1.5680us (1 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - void kernelPointwiseApply2<Tensor_ceil_Float_Op, float, float, unsigned int, int=1, int=1>(OffsetInfo<float, unsigned int, int=1>, OffsetInfo<float, unsigned int, int=1>, unsigned int, Tensor_ceil_Float_Op) [1226]
13.1057s 1.5680us (1 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - void kernelPointwiseApply3<TensorNEOp<float, unsigned char>, unsigned char, float, float, unsigned int, int=1, int=1, int=1>(OffsetInfo<unsigned char, float, float>, OffsetInfo<TensorNEOp<float, unsigned char>, float, unsigned int>, OffsetInfo<unsigned char, float, int=1>, float, float) [1238]
13.1058s 1.6000us - - - - - 1B 610.35KB/s Device Pageable Tesla V100-SXM2 2 18 - - - - [CUDA memcpy DtoH]
13.1059s 2.3680us (1 1 1) (128 1 1) 26 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - ZN2at6native18elementwise_kernelILi128ELi4EZNS0_17gpu_binary_kernelIZNS0_15div_kernel_implIdEEvRNS_14TensorIteratorEEUlddE_EEvS5_RKT_EUliE0_EEviT1 [1259]
13.1060s 1.5360us (1 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - void kernelPointwiseApply2<TensorGTValueOp<double, unsigned char>, unsigned char, double, unsigned int, int=1, int=1>(OffsetInfo<unsigned char, unsigned char, double>, OffsetInfo<TensorGTValueOp<double, unsigned char>, unsigned char, unsigned int>, unsigned char, double) [1270]
13.1060s 1.4720us - - - - - 1B 663.42KB/s Device Pageable Tesla V100-SXM2 2 18 - - - - [CUDA memcpy DtoH]
13.1060s 1.2480us (1 1 1) (512 1 1) 16 0B 0B - - - - Tesla V100-SXM2 2 18 - - - - void kernelPointwiseApply2<TensorGTValueOp<double, unsigned char>, unsigned char, double, unsigned int, int=1, int=1>(OffsetInfo<unsigned char, unsigned char, double>, OffsetInfo<TensorGTValueOp<double, unsigned char>, unsigned char, unsigned int>, unsigned char, double) [1289]
13.1060s 1.4720us - - - - - 1B 663.42KB/s Device Pageable Tesla V100-SXM2 2 18 - - - - [CUDA memcpy DtoH]
13.1061s 1.4720us - - - - - 4B 2.5915MB/s Device Pageable Tesla V100-SXM2 2 18 - - - - [CUDA memcpy DtoH]
13.1061s 1.9520us - - - - - 4B 1.9543MB/s Device Pageable Tesla V100-SXM2 2 18 - - - - [CUDA memcpy DtoH]

I’ve answered on the issue https://github.com/pytorch/pytorch/issues/15568