AWS P4 instance: Not able to run single node multi GPU training with PyTorch 1.5.0 + Cuda10.1

Hi PyTorch Team,

I’m trying to use AWS p4 instances to train Neural Machine Translation model using fairseq. I am trying to perform training using single node. So this is not the distributed training. I am able to train the NMT model with single GPU but when I wanted to use the multiple GPUs on single instance it is throwing the following error log.

 | distributed init (rank 1): tcp://localhost:13852
 | distributed init (rank 2): tcp://localhost:13852
 | distributed init (rank 5): tcp://localhost:13852
 | distributed init (rank 3): tcp://localhost:13852
 | distributed init (rank 6): tcp://localhost:13852
 | distributed init (rank 4): tcp://localhost:13852
 | distributed init (rank 0): tcp://localhost:13852
 | distributed init (rank 7): tcp://localhost:13852
 | initialized host ip-10-7-6-34 as rank 7
 | initialized host ip-10-7-6-34 as rank 1
 | initialized host ip-10-7-6-34 as rank 2
 | initialized host ip-10-7-6-34 as rank 5
 | initialized host ip-10-7-6-34 as rank 3
 | initialized host ip-10-7-6-34 as rank 6
 | initialized host ip-10-7-6-34 as rank 4
 | initialized host ip-10-7-6-34 as rank 0
 ip-10-7-6-34:26807:26807 [0] NCCL INFO Bootstrap : Using [0]ens32:10.7.6.34<0>
 ip-10-7-6-34:26807:26807 [0] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/cuda-10.1/efa/s     hare/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 ip-10-7-6-34:26807:26807 [0] NCCL INFO NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
 
 ip-10-7-6-34:26807:26807 [0] ofi_init:1136 NCCL WARN NET/OFI Only EFA provider is supported
 ip-10-7-6-34:26807:26807 [0] NCCL INFO NET/IB : No device found.
 ip-10-7-6-34:26807:26807 [0] NCCL INFO NET/Socket : Using [0]ens32:10.7.6.34<0>
 NCCL version 2.4.8+cuda10.1
 ip-10-7-6-34:26807:27073 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
 ip-10-7-6-34:26814:26814 [7] NCCL INFO Bootstrap : Using [0]ens32:10.7.6.34<0>
 ip-10-7-6-34:26814:26814 [7] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/cuda-10.1/efa/s     hare/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 ip-10-7-6-34:26814:26814 [7] NCCL INFO NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
 ip-10-7-6-34:26814:26814 [7] ofi_init:1136 NCCL WARN NET/OFI Only EFA provider is supported
 ip-10-7-6-34:26814:26814 [7] NCCL INFO NET/IB : No device found.
 ip-10-7-6-34:26814:26814 [7] NCCL INFO NET/Socket : Using [0]ens32:10.7.6.34<0>
 ip-10-7-6-34:26814:27100 [7] NCCL INFO Setting affinity for GPU 7 to ffffff00,0000ffff,ff000000
 ip-10-7-6-34:26810:26810 [3] NCCL INFO Bootstrap : Using [0]ens32:10.7.6.34<0>
 ip-10-7-6-34:26810:26810 [3] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/cuda-10.1/efa/s     hare/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 ip-10-7-6-34:26810:26810 [3] NCCL INFO NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
 
 ip-10-7-6-34:26810:26810 [3] ofi_init:1136 NCCL WARN NET/OFI Only EFA provider is supported
 ip-10-7-6-34:26810:26810 [3] NCCL INFO NET/IB : No device found.
 ip-10-7-6-34:26810:26810 [3] NCCL INFO NET/Socket : Using [0]ens32:10.7.6.34<0>
 ip-10-7-6-34:26810:27104 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffff0000,00ffffff
 ip-10-7-6-34:26811:26811 [4] NCCL INFO Bootstrap : Using [0]ens32:10.7.6.34<0>
 ip-10-7-6-34:26811:26811 [4] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/cuda-10.1/efa/s     hare/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 ip-10-7-6-34:26811:26811 [4] NCCL INFO NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
 
 ip-10-7-6-34:26811:26811 [4] ofi_init:1136 NCCL WARN NET/OFI Only EFA provider is supported
 ip-10-7-6-34:26811:26811 [4] NCCL INFO NET/IB : No device found.
 ip-10-7-6-34:26811:26811 [4] NCCL INFO NET/Socket : Using [0]ens32:10.7.6.34<0>
 ip-10-7-6-34:26811:27118 [4] NCCL INFO Setting affinity for GPU 4 to ffffff00,0000ffff,ff000000
 ip-10-7-6-34:26813:26813 [6] NCCL INFO Bootstrap : Using [0]ens32:10.7.6.34<0>
 ip-10-7-6-34:26813:26813 [6] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/cuda-10.1/efa/s     hare/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 ip-10-7-6-34:26813:26813 [6] NCCL INFO NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
 
 ip-10-7-6-34:26813:26813 [6] ofi_init:1136 NCCL WARN NET/OFI Only EFA provider is supported
 ip-10-7-6-34:26813:26813 [6] NCCL INFO NET/IB : No device found.
 ip-10-7-6-34:26813:26813 [6] NCCL INFO NET/Socket : Using [0]ens32:10.7.6.34<0>
 ip-10-7-6-34:26813:27129 [6] NCCL INFO Setting affinity for GPU 6 to ffffff00,0000ffff,ff000000
 ip-10-7-6-34:26809:26809 [2] NCCL INFO Bootstrap : Using [0]ens32:10.7.6.34<0>
 ip-10-7-6-34:26809:26809 [2] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/cuda-10.1/efa/s     hare/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 ip-10-7-6-34:26809:26809 [2] NCCL INFO NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
 
 ip-10-7-6-34:26809:26809 [2] ofi_init:1136 NCCL WARN NET/OFI Only EFA provider is supported
 ip-10-7-6-34:26809:26809 [2] NCCL INFO NET/IB : No device found.
 ip-10-7-6-34:26809:26809 [2] NCCL INFO NET/Socket : Using [0]ens32:10.7.6.34<0>
 ip-10-7-6-34:26809:27139 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffff0000,00ffffff
 ip-10-7-6-34:26808:26808 [1] NCCL INFO Bootstrap : Using [0]ens32:10.7.6.34<0>
 ip-10-7-6-34:26808:26808 [1] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/cuda-10.1/efa/s     hare/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 ip-10-7-6-34:26808:26808 [1] NCCL INFO NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
 
 ip-10-7-6-34:26808:26808 [1] ofi_init:1136 NCCL WARN NET/OFI Only EFA provider is supported
 ip-10-7-6-34:26808:26808 [1] NCCL INFO NET/IB : No device found.
 ip-10-7-6-34:26808:26808 [1] NCCL INFO NET/Socket : Using [0]ens32:10.7.6.34<0>
 ip-10-7-6-34:26808:27144 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffff0000,00ffffff
 ip-10-7-6-34:26812:26812 [5] NCCL INFO Bootstrap : Using [0]ens32:10.7.6.34<0>
 ip-10-7-6-34:26812:26812 [5] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/cuda-10.1/efa/s     hare/aws-ofi-nccl/xml/p4d-24xl-topo.xml
 ip-10-7-6-34:26812:26812 [5] NCCL INFO NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
 
 ip-10-7-6-34:26812:26812 [5] ofi_init:1136 NCCL WARN NET/OFI Only EFA provider is supported
 ip-10-7-6-34:26812:26812 [5] NCCL INFO NET/IB : No device found.
 ip-10-7-6-34:26812:26812 [5] NCCL INFO NET/Socket : Using [0]ens32:10.7.6.34<0>
 ip-10-7-6-34:26812:27154 [5] NCCL INFO Setting affinity for GPU 5 to ffffff00,0000ffff,ff000000
 ip-10-7-6-34:26807:27073 [0] NCCL INFO Channel 00 :    0   1   2   3   4   5   6   7
 ip-10-7-6-34:26807:27073 [0] NCCL INFO Channel 01 :    0   1   2   3   4   5   6   7
 ip-10-7-6-34:26807:27073 [0] NCCL INFO Channel 02 :    0   1   2   3   4   5   6   7
 ip-10-7-6-34:26807:27073 [0] NCCL INFO Channel 03 :    0   1   2   3   4   5   6   7
 ip-10-7-6-34:26807:27073 [0] NCCL INFO Channel 04 :    0   1   2   3   4   5   6   7
 ip-10-7-6-34:26807:27073 [0] NCCL INFO Channel 05 :    0   1   2   3   4   5   6   7
 ip-10-7-6-34:26807:27073 [0] NCCL INFO Channel 06 :    0   1   2   3   4   5   6   7
 ip-10-7-6-34:26807:27073 [0] NCCL INFO Channel 07 :    0   1   2   3   4   5   6   7
 ip-10-7-6-34:26807:27073 [0] NCCL INFO Channel 08 :    0   1   2   3   4   5   6   7
 ip-10-7-6-34:26807:27073 [0] NCCL INFO Channel 09 :    0   1   2   3   4   5   6   7
 ip-10-7-6-34:26807:27073 [0] NCCL INFO Channel 10 :    0   1   2   3   4   5   6   7
 ip-10-7-6-34:26807:27073 [0] NCCL INFO Channel 11 :    0   1   2   3   4   5   6   7
 ip-10-7-6-34:26808:27144 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
 ip-10-7-6-34:26809:27139 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
 ip-10-7-6-34:26810:27104 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via P2P/IPC
 ip-10-7-6-34:26812:27154 [5] NCCL INFO Ring 00 : 5[5] -> 6[6] via P2P/IPC
 ip-10-7-6-34:26811:27118 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via P2P/IPC
 ip-10-7-6-34:26813:27129 [6] NCCL INFO Ring 00 : 6[6] -> 7[7] via P2P/IPC
 ip-10-7-6-34:26807:27073 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
 ip-10-7-6-34:26814:27100 [7] NCCL INFO Ring 00 : 7[7] -> 0[0] via P2P/IPC
 ip-10-7-6-34:26808:27144 [1] NCCL INFO Ring 01 : 1[1] -> 2[2] via P2P/IPC
 ip-10-7-6-34:26809:27139 [2] NCCL INFO Ring 01 : 2[2] -> 3[3] via P2P/IPC
 ip-10-7-6-34:26810:27104 [3] NCCL INFO Ring 01 : 3[3] -> 4[4] via P2P/IPC
 ip-10-7-6-34:26812:27154 [5] NCCL INFO Ring 01 : 5[5] -> 6[6] via P2P/IPC
 ip-10-7-6-34:26811:27118 [4] NCCL INFO Ring 01 : 4[4] -> 5[5] via P2P/IPC
 ip-10-7-6-34:26813:27129 [6] NCCL INFO Ring 01 : 6[6] -> 7[7] via P2P/IPC
 ip-10-7-6-34:26807:27073 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/IPC
 ip-10-7-6-34:26814:27100 [7] NCCL INFO Ring 01 : 7[7] -> 0[0] via P2P/IPC
 ip-10-7-6-34:26808:27144 [1] NCCL INFO Ring 02 : 1[1] -> 2[2] via P2P/IPC
 ip-10-7-6-34:26809:27139 [2] NCCL INFO Ring 02 : 2[2] -> 3[3] via P2P/IPC
 ip-10-7-6-34:26810:27104 [3] NCCL INFO Ring 02 : 3[3] -> 4[4] via P2P/IPC
 ip-10-7-6-34:26812:27154 [5] NCCL INFO Ring 02 : 5[5] -> 6[6] via P2P/IPC
 ip-10-7-6-34:26811:27118 [4] NCCL INFO Ring 02 : 4[4] -> 5[5] via P2P/IPC
 ip-10-7-6-34:26813:27129 [6] NCCL INFO Ring 02 : 6[6] -> 7[7] via P2P/IPC
 ip-10-7-6-34:26807:27073 [0] NCCL INFO Ring 02 : 0[0] -> 1[1] via P2P/IPC
 ip-10-7-6-34:26814:27100 [7] NCCL INFO Ring 02 : 7[7] -> 0[0] via P2P/IPC
 ip-10-7-6-34:26808:27144 [1] NCCL INFO Ring 03 : 1[1] -> 2[2] via P2P/IPC
 ip-10-7-6-34:26809:27139 [2] NCCL INFO Ring 03 : 2[2] -> 3[3] via P2P/IPC
 ip-10-7-6-34:26810:27104 [3] NCCL INFO Ring 03 : 3[3] -> 4[4] via P2P/IPC
 ip-10-7-6-34:26812:27154 [5] NCCL INFO Ring 03 : 5[5] -> 6[6] via P2P/IPC
 ip-10-7-6-34:26811:27118 [4] NCCL INFO Ring 03 : 4[4] -> 5[5] via P2P/IPC
 ip-10-7-6-34:26813:27129 [6] NCCL INFO Ring 03 : 6[6] -> 7[7] via P2P/IPC
 ip-10-7-6-34:26807:27073 [0] NCCL INFO Ring 03 : 0[0] -> 1[1] via P2P/IPC
 ip-10-7-6-34:26814:27100 [7] NCCL INFO Ring 03 : 7[7] -> 0[0] via P2P/IPC
 ip-10-7-6-34:26808:27144 [1] NCCL INFO Ring 04 : 1[1] -> 2[2] via P2P/IPC
 ip-10-7-6-34:26809:27139 [2] NCCL INFO Ring 04 : 2[2] -> 3[3] via P2P/IPC
 ip-10-7-6-34:26810:27104 [3] NCCL INFO Ring 04 : 3[3] -> 4[4] via P2P/IPC
 ip-10-7-6-34:26812:27154 [5] NCCL INFO Ring 04 : 5[5] -> 6[6] via P2P/IPC
 ip-10-7-6-34:26811:27118 [4] NCCL INFO Ring 04 : 4[4] -> 5[5] via P2P/IPC
 ip-10-7-6-34:26813:27129 [6] NCCL INFO Ring 04 : 6[6] -> 7[7] via P2P/IPC
 ip-10-7-6-34:26807:27073 [0] NCCL INFO Ring 04 : 0[0] -> 1[1] via P2P/IPC
 ip-10-7-6-34:26814:27100 [7] NCCL INFO Ring 04 : 7[7] -> 0[0] via P2P/IPC
 ip-10-7-6-34:26808:27144 [1] NCCL INFO Ring 05 : 1[1] -> 2[2] via P2P/IPC
 ip-10-7-6-34:26809:27139 [2] NCCL INFO Ring 05 : 2[2] -> 3[3] via P2P/IPC
 ip-10-7-6-34:26810:27104 [3] NCCL INFO Ring 05 : 3[3] -> 4[4] via P2P/IPC
 ip-10-7-6-34:26812:27154 [5] NCCL INFO Ring 05 : 5[5] -> 6[6] via P2P/IPC
 ip-10-7-6-34:26811:27118 [4] NCCL INFO Ring 05 : 4[4] -> 5[5] via P2P/IPC
 ip-10-7-6-34:26813:27129 [6] NCCL INFO Ring 05 : 6[6] -> 7[7] via P2P/IPC
 ip-10-7-6-34:26807:27073 [0] NCCL INFO Ring 05 : 0[0] -> 1[1] via P2P/IPC
 ip-10-7-6-34:26814:27100 [7] NCCL INFO Ring 05 : 7[7] -> 0[0] via P2P/IPC
 ip-10-7-6-34:26808:27144 [1] NCCL INFO Ring 06 : 1[1] -> 2[2] via P2P/IPC
 ip-10-7-6-34:26809:27139 [2] NCCL INFO Ring 06 : 2[2] -> 3[3] via P2P/IPC
 ip-10-7-6-34:26810:27104 [3] NCCL INFO Ring 06 : 3[3] -> 4[4] via P2P/IPC
 ip-10-7-6-34:26812:27154 [5] NCCL INFO Ring 06 : 5[5] -> 6[6] via P2P/IPC
 ip-10-7-6-34:26811:27118 [4] NCCL INFO Ring 06 : 4[4] -> 5[5] via P2P/IPC
 ip-10-7-6-34:26813:27129 [6] NCCL INFO Ring 06 : 6[6] -> 7[7] via P2P/IPC
 ip-10-7-6-34:26807:27073 [0] NCCL INFO Ring 06 : 0[0] -> 1[1] via P2P/IPC
 ip-10-7-6-34:26814:27100 [7] NCCL INFO Ring 06 : 7[7] -> 0[0] via P2P/IPC
 ip-10-7-6-34:26808:27144 [1] NCCL INFO Ring 07 : 1[1] -> 2[2] via P2P/IPC
 ip-10-7-6-34:26809:27139 [2] NCCL INFO Ring 07 : 2[2] -> 3[3] via P2P/IPC
 ip-10-7-6-34:26810:27104 [3] NCCL INFO Ring 07 : 3[3] -> 4[4] via P2P/IPC
 ip-10-7-6-34:26812:27154 [5] NCCL INFO Ring 07 : 5[5] -> 6[6] via P2P/IPC
 ip-10-7-6-34:26811:27118 [4] NCCL INFO Ring 07 : 4[4] -> 5[5] via P2P/IPC
 ip-10-7-6-34:26813:27129 [6] NCCL INFO Ring 07 : 6[6] -> 7[7] via P2P/IPC
 ip-10-7-6-34:26807:27073 [0] NCCL INFO Ring 07 : 0[0] -> 1[1] via P2P/IPC
 ip-10-7-6-34:26814:27100 [7] NCCL INFO Ring 07 : 7[7] -> 0[0] via P2P/IPC
 ip-10-7-6-34:26808:27144 [1] NCCL INFO Ring 08 : 1[1] -> 2[2] via P2P/IPC
 ip-10-7-6-34:26809:27139 [2] NCCL INFO Ring 08 : 2[2] -> 3[3] via P2P/IPC
 ip-10-7-6-34:26810:27104 [3] NCCL INFO Ring 08 : 3[3] -> 4[4] via P2P/IPC
 ip-10-7-6-34:26812:27154 [5] NCCL INFO Ring 08 : 5[5] -> 6[6] via P2P/IPC
 ip-10-7-6-34:26811:27118 [4] NCCL INFO Ring 08 : 4[4] -> 5[5] via P2P/IPC
 ip-10-7-6-34:26813:27129 [6] NCCL INFO Ring 08 : 6[6] -> 7[7] via P2P/IPC
 ip-10-7-6-34:26807:27073 [0] NCCL INFO Ring 08 : 0[0] -> 1[1] via P2P/IPC
 ip-10-7-6-34:26814:27100 [7] NCCL INFO Ring 08 : 7[7] -> 0[0] via P2P/IPC
 ip-10-7-6-34:26808:27144 [1] NCCL INFO Ring 09 : 1[1] -> 2[2] via P2P/IPC
 ip-10-7-6-34:26809:27139 [2] NCCL INFO Ring 09 : 2[2] -> 3[3] via P2P/IPC
 ip-10-7-6-34:26810:27104 [3] NCCL INFO Ring 09 : 3[3] -> 4[4] via P2P/IPC
 ip-10-7-6-34:26812:27154 [5] NCCL INFO Ring 09 : 5[5] -> 6[6] via P2P/IPC
 ip-10-7-6-34:26811:27118 [4] NCCL INFO Ring 09 : 4[4] -> 5[5] via P2P/IPC
 ip-10-7-6-34:26813:27129 [6] NCCL INFO Ring 09 : 6[6] -> 7[7] via P2P/IPC
 ip-10-7-6-34:26807:27073 [0] NCCL INFO Ring 09 : 0[0] -> 1[1] via P2P/IPC
 ip-10-7-6-34:26814:27100 [7] NCCL INFO Ring 09 : 7[7] -> 0[0] via P2P/IPC
 ip-10-7-6-34:26808:27144 [1] NCCL INFO Ring 10 : 1[1] -> 2[2] via P2P/IPC
 ip-10-7-6-34:26809:27139 [2] NCCL INFO Ring 10 : 2[2] -> 3[3] via P2P/IPC
 ip-10-7-6-34:26810:27104 [3] NCCL INFO Ring 10 : 3[3] -> 4[4] via P2P/IPC
 ip-10-7-6-34:26812:27154 [5] NCCL INFO Ring 10 : 5[5] -> 6[6] via P2P/IPC
 ip-10-7-6-34:26811:27118 [4] NCCL INFO Ring 10 : 4[4] -> 5[5] via P2P/IPC
 ip-10-7-6-34:26813:27129 [6] NCCL INFO Ring 10 : 6[6] -> 7[7] via P2P/IPC
 ip-10-7-6-34:26807:27073 [0] NCCL INFO Ring 10 : 0[0] -> 1[1] via P2P/IPC
 ip-10-7-6-34:26814:27100 [7] NCCL INFO Ring 10 : 7[7] -> 0[0] via P2P/IPC
 ip-10-7-6-34:26808:27144 [1] NCCL INFO Ring 11 : 1[1] -> 2[2] via P2P/IPC
 ip-10-7-6-34:26810:27104 [3] NCCL INFO Ring 11 : 3[3] -> 4[4] via P2P/IPC
 ip-10-7-6-34:26809:27139 [2] NCCL INFO Ring 11 : 2[2] -> 3[3] via P2P/IPC
 ip-10-7-6-34:26812:27154 [5] NCCL INFO Ring 11 : 5[5] -> 6[6] via P2P/IPC
 ip-10-7-6-34:26811:27118 [4] NCCL INFO Ring 11 : 4[4] -> 5[5] via P2P/IPC
 ip-10-7-6-34:26813:27129 [6] NCCL INFO Ring 11 : 6[6] -> 7[7] via P2P/IPC
 ip-10-7-6-34:26807:27073 [0] NCCL INFO Ring 11 : 0[0] -> 1[1] via P2P/IPC
 ip-10-7-6-34:26814:27100 [7] NCCL INFO Ring 11 : 7[7] -> 0[0] via P2P/IPC
 ip-10-7-6-34:26807:27073 [0] NCCL INFO Using 256 threads, Min Comp Cap 8, Trees disabled
 ip-10-7-6-34:26808:27144 [1] NCCL INFO comm 0x7fa7e40028a0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 - Init COMPLETE
 
 ip-10-7-6-34:26808:26808 [1] enqueue.cc:197 NCCL WARN Cuda failure 'invalid device function'
 ip-10-7-6-34:26808:26808 [1] NCCL INFO misc/group.cc:148 -> 1
 ip-10-7-6-34:26812:27154 [5] NCCL INFO comm 0x7f21f40028a0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 - Init COMPLETE
 ip-10-7-6-34:26813:27129 [6] NCCL INFO comm 0x7f74c40028a0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 - Init COMPLETE
 ip-10-7-6-34:26813:26813 [6] enqueue.cc:197 NCCL WARN Cuda failure 'invalid device function'
 ip-10-7-6-34:26813:26813 [6] NCCL INFO misc/group.cc:148 -> 1
 ip-10-7-6-34:26810:27104 [3] NCCL INFO comm 0x7f2d640028a0 rank 3 nranks 8 cudaDev 3 nvmlDev 3 - Init COMPLETE
 
 ip-10-7-6-34:26812:26812 [5] enqueue.cc:197 NCCL WARN Cuda failure 'invalid device function'
 ip-10-7-6-34:26812:26812 [5] NCCL INFO misc/group.cc:148 -> 1
 
 ip-10-7-6-34:26810:26810 [3] enqueue.cc:197 NCCL WARN Cuda failure 'invalid device function'
 ip-10-7-6-34:26810:26810 [3] NCCL INFO misc/group.cc:148 -> 1
 ip-10-7-6-34:26809:27139 [2] NCCL INFO comm 0x7f2e380028a0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 - Init COMPLETE
 ip-10-7-6-34:26811:27118 [4] NCCL INFO comm 0x7fc5f00028a0 rank 4 nranks 8 cudaDev 4 nvmlDev 4 - Init COMPLETE
 
 ip-10-7-6-34:26809:26809 [2] enqueue.cc:197 NCCL WARN Cuda failure 'invalid device function'
 ip-10-7-6-34:26809:26809 [2] NCCL INFO misc/group.cc:148 -> 1
 ip-10-7-6-34:26814:27100 [7] NCCL INFO comm 0x7f09080028a0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 - Init COMPLETE
 ip-10-7-6-34:26807:27073 [0] NCCL INFO comm 0x7f50e40028a0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 - Init COMPLETE
 
 ip-10-7-6-34:26811:26811 [4] enqueue.cc:197 NCCL WARN Cuda failure 'invalid device function'
 ip-10-7-6-34:26811:26811 [4] NCCL INFO misc/group.cc:148 -> 1
 ip-10-7-6-34:26807:26807 [0] NCCL INFO Launch mode Parallel
 
 ip-10-7-6-34:26807:26807 [0] enqueue.cc:197 NCCL WARN Cuda failure 'invalid device function'
 ip-10-7-6-34:26807:26807 [0] NCCL INFO misc/group.cc:148 -> 1
 
 ip-10-7-6-34:26814:26814 [7] enqueue.cc:197 NCCL WARN Cuda failure 'invalid device function'
 ip-10-7-6-34:26814:26814 [7] NCCL INFO misc/group.cc:148 -> 1
 Traceback (most recent call last):
   File "/home/ubuntu/train.py", line 315, in <module>
     cli_main()
   File "/home/ubuntu/train.py", line 307, in cli_main
     nprocs=args.distributed_world_size,
   File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
     return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
   File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 158, in tart_processes
     while not context.join():
   File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 119, in join
     raise Exception(msg)
 Exception:
 
 -- Process 5 terminated with the following error:
 Traceback (most recent call last):
   File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
     fn(i, *args)
   File "/home/ubuntu/train.py", line 274, in distributed_main
   File "/home/ubuntu/train.py", line 36, in main
     args.distributed_rank = distributed_utils.distributed_init(args)
   File "/home/ubuntu/fairseq/distributed_utils.py", line 85, in distributed_init
     dist.all_reduce(torch.rand(1).cuda())
   File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 898, in all_reduce
     work = _default_pg.allreduce([tensor], opts)
 RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1587428091666/work/torch/lib/c10d/ProcessGroupNCCL.cpp:32, unhandled cuda error, NCCL version 2.4.8


Environment:

  • Aws DeepLearning AMI with Ubuntu 18.04
  • Cuda: 10.1
  • PyTorch: 1.5.0 with Cuda 10.1
  • Fairseq: 0.9
  • NCCL version: 2.4.8
  • GPU type: Nvidia - A100
  • Number of GPUs - 8
  • Number of Nodes - 1

I have referred the following GitHub issues and tried out most of the suggestions which are given there but I’m not able to resolve this issue.

The “unhandled cuda error” is more of a generic error message - the actual error may be indicated by the “Cuda failure ‘invalid device function’”.

These errors are usually caused by some incorrect configurations, and you may try setting the following:
* NCCL_TREE_THRESHOLD=0
* NCCL_SOCKET_IFNAME=
* NCCL_IB_DISABLE=1
* NCCL_P2P_DISABLE=1

A similar error from Horovod (link here: Cuda failure 'invalid device function' · Issue #1171 · horovod/horovod · GitHub) was fixed by setting NCCL_IB_DISABLE=1

This error might be raised, as you are using a node with A100s (sm_80) and an old PyTorch 1.5.0 binary with the CUDA10.1 runtime, which doesn’t support this GPU architecture.
While the CUDA JIT might kick in to compile native PyTorch kernels, other libraries might raise the posted error, so you would have to update PyTorch to the latest release with CUDA11.0.

Thanks @osalpekar and @ptrblck for your suggestions. I will try with upgraded PyTorch version with CUDA11.0 and see if that is working for me or not.