Hi PyTorch Team,
I’m trying to use AWS p4 instances to train Neural Machine Translation model using fairseq. I am trying to perform training using single node. So this is not the distributed training. I am able to train the NMT model with single GPU but when I wanted to use the multiple GPUs on single instance it is throwing the following error log.
| distributed init (rank 1): tcp://localhost:13852
| distributed init (rank 2): tcp://localhost:13852
| distributed init (rank 5): tcp://localhost:13852
| distributed init (rank 3): tcp://localhost:13852
| distributed init (rank 6): tcp://localhost:13852
| distributed init (rank 4): tcp://localhost:13852
| distributed init (rank 0): tcp://localhost:13852
| distributed init (rank 7): tcp://localhost:13852
| initialized host ip-10-7-6-34 as rank 7
| initialized host ip-10-7-6-34 as rank 1
| initialized host ip-10-7-6-34 as rank 2
| initialized host ip-10-7-6-34 as rank 5
| initialized host ip-10-7-6-34 as rank 3
| initialized host ip-10-7-6-34 as rank 6
| initialized host ip-10-7-6-34 as rank 4
| initialized host ip-10-7-6-34 as rank 0
ip-10-7-6-34:26807:26807 [0] NCCL INFO Bootstrap : Using [0]ens32:10.7.6.34<0>
ip-10-7-6-34:26807:26807 [0] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/cuda-10.1/efa/s hare/aws-ofi-nccl/xml/p4d-24xl-topo.xml
ip-10-7-6-34:26807:26807 [0] NCCL INFO NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
ip-10-7-6-34:26807:26807 [0] ofi_init:1136 NCCL WARN NET/OFI Only EFA provider is supported
ip-10-7-6-34:26807:26807 [0] NCCL INFO NET/IB : No device found.
ip-10-7-6-34:26807:26807 [0] NCCL INFO NET/Socket : Using [0]ens32:10.7.6.34<0>
NCCL version 2.4.8+cuda10.1
ip-10-7-6-34:26807:27073 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
ip-10-7-6-34:26814:26814 [7] NCCL INFO Bootstrap : Using [0]ens32:10.7.6.34<0>
ip-10-7-6-34:26814:26814 [7] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/cuda-10.1/efa/s hare/aws-ofi-nccl/xml/p4d-24xl-topo.xml
ip-10-7-6-34:26814:26814 [7] NCCL INFO NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
ip-10-7-6-34:26814:26814 [7] ofi_init:1136 NCCL WARN NET/OFI Only EFA provider is supported
ip-10-7-6-34:26814:26814 [7] NCCL INFO NET/IB : No device found.
ip-10-7-6-34:26814:26814 [7] NCCL INFO NET/Socket : Using [0]ens32:10.7.6.34<0>
ip-10-7-6-34:26814:27100 [7] NCCL INFO Setting affinity for GPU 7 to ffffff00,0000ffff,ff000000
ip-10-7-6-34:26810:26810 [3] NCCL INFO Bootstrap : Using [0]ens32:10.7.6.34<0>
ip-10-7-6-34:26810:26810 [3] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/cuda-10.1/efa/s hare/aws-ofi-nccl/xml/p4d-24xl-topo.xml
ip-10-7-6-34:26810:26810 [3] NCCL INFO NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
ip-10-7-6-34:26810:26810 [3] ofi_init:1136 NCCL WARN NET/OFI Only EFA provider is supported
ip-10-7-6-34:26810:26810 [3] NCCL INFO NET/IB : No device found.
ip-10-7-6-34:26810:26810 [3] NCCL INFO NET/Socket : Using [0]ens32:10.7.6.34<0>
ip-10-7-6-34:26810:27104 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffff0000,00ffffff
ip-10-7-6-34:26811:26811 [4] NCCL INFO Bootstrap : Using [0]ens32:10.7.6.34<0>
ip-10-7-6-34:26811:26811 [4] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/cuda-10.1/efa/s hare/aws-ofi-nccl/xml/p4d-24xl-topo.xml
ip-10-7-6-34:26811:26811 [4] NCCL INFO NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
ip-10-7-6-34:26811:26811 [4] ofi_init:1136 NCCL WARN NET/OFI Only EFA provider is supported
ip-10-7-6-34:26811:26811 [4] NCCL INFO NET/IB : No device found.
ip-10-7-6-34:26811:26811 [4] NCCL INFO NET/Socket : Using [0]ens32:10.7.6.34<0>
ip-10-7-6-34:26811:27118 [4] NCCL INFO Setting affinity for GPU 4 to ffffff00,0000ffff,ff000000
ip-10-7-6-34:26813:26813 [6] NCCL INFO Bootstrap : Using [0]ens32:10.7.6.34<0>
ip-10-7-6-34:26813:26813 [6] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/cuda-10.1/efa/s hare/aws-ofi-nccl/xml/p4d-24xl-topo.xml
ip-10-7-6-34:26813:26813 [6] NCCL INFO NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
ip-10-7-6-34:26813:26813 [6] ofi_init:1136 NCCL WARN NET/OFI Only EFA provider is supported
ip-10-7-6-34:26813:26813 [6] NCCL INFO NET/IB : No device found.
ip-10-7-6-34:26813:26813 [6] NCCL INFO NET/Socket : Using [0]ens32:10.7.6.34<0>
ip-10-7-6-34:26813:27129 [6] NCCL INFO Setting affinity for GPU 6 to ffffff00,0000ffff,ff000000
ip-10-7-6-34:26809:26809 [2] NCCL INFO Bootstrap : Using [0]ens32:10.7.6.34<0>
ip-10-7-6-34:26809:26809 [2] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/cuda-10.1/efa/s hare/aws-ofi-nccl/xml/p4d-24xl-topo.xml
ip-10-7-6-34:26809:26809 [2] NCCL INFO NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
ip-10-7-6-34:26809:26809 [2] ofi_init:1136 NCCL WARN NET/OFI Only EFA provider is supported
ip-10-7-6-34:26809:26809 [2] NCCL INFO NET/IB : No device found.
ip-10-7-6-34:26809:26809 [2] NCCL INFO NET/Socket : Using [0]ens32:10.7.6.34<0>
ip-10-7-6-34:26809:27139 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffff0000,00ffffff
ip-10-7-6-34:26808:26808 [1] NCCL INFO Bootstrap : Using [0]ens32:10.7.6.34<0>
ip-10-7-6-34:26808:26808 [1] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/cuda-10.1/efa/s hare/aws-ofi-nccl/xml/p4d-24xl-topo.xml
ip-10-7-6-34:26808:26808 [1] NCCL INFO NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
ip-10-7-6-34:26808:26808 [1] ofi_init:1136 NCCL WARN NET/OFI Only EFA provider is supported
ip-10-7-6-34:26808:26808 [1] NCCL INFO NET/IB : No device found.
ip-10-7-6-34:26808:26808 [1] NCCL INFO NET/Socket : Using [0]ens32:10.7.6.34<0>
ip-10-7-6-34:26808:27144 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffff0000,00ffffff
ip-10-7-6-34:26812:26812 [5] NCCL INFO Bootstrap : Using [0]ens32:10.7.6.34<0>
ip-10-7-6-34:26812:26812 [5] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/cuda-10.1/efa/s hare/aws-ofi-nccl/xml/p4d-24xl-topo.xml
ip-10-7-6-34:26812:26812 [5] NCCL INFO NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
ip-10-7-6-34:26812:26812 [5] ofi_init:1136 NCCL WARN NET/OFI Only EFA provider is supported
ip-10-7-6-34:26812:26812 [5] NCCL INFO NET/IB : No device found.
ip-10-7-6-34:26812:26812 [5] NCCL INFO NET/Socket : Using [0]ens32:10.7.6.34<0>
ip-10-7-6-34:26812:27154 [5] NCCL INFO Setting affinity for GPU 5 to ffffff00,0000ffff,ff000000
ip-10-7-6-34:26807:27073 [0] NCCL INFO Channel 00 : 0 1 2 3 4 5 6 7
ip-10-7-6-34:26807:27073 [0] NCCL INFO Channel 01 : 0 1 2 3 4 5 6 7
ip-10-7-6-34:26807:27073 [0] NCCL INFO Channel 02 : 0 1 2 3 4 5 6 7
ip-10-7-6-34:26807:27073 [0] NCCL INFO Channel 03 : 0 1 2 3 4 5 6 7
ip-10-7-6-34:26807:27073 [0] NCCL INFO Channel 04 : 0 1 2 3 4 5 6 7
ip-10-7-6-34:26807:27073 [0] NCCL INFO Channel 05 : 0 1 2 3 4 5 6 7
ip-10-7-6-34:26807:27073 [0] NCCL INFO Channel 06 : 0 1 2 3 4 5 6 7
ip-10-7-6-34:26807:27073 [0] NCCL INFO Channel 07 : 0 1 2 3 4 5 6 7
ip-10-7-6-34:26807:27073 [0] NCCL INFO Channel 08 : 0 1 2 3 4 5 6 7
ip-10-7-6-34:26807:27073 [0] NCCL INFO Channel 09 : 0 1 2 3 4 5 6 7
ip-10-7-6-34:26807:27073 [0] NCCL INFO Channel 10 : 0 1 2 3 4 5 6 7
ip-10-7-6-34:26807:27073 [0] NCCL INFO Channel 11 : 0 1 2 3 4 5 6 7
ip-10-7-6-34:26808:27144 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
ip-10-7-6-34:26809:27139 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
ip-10-7-6-34:26810:27104 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via P2P/IPC
ip-10-7-6-34:26812:27154 [5] NCCL INFO Ring 00 : 5[5] -> 6[6] via P2P/IPC
ip-10-7-6-34:26811:27118 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via P2P/IPC
ip-10-7-6-34:26813:27129 [6] NCCL INFO Ring 00 : 6[6] -> 7[7] via P2P/IPC
ip-10-7-6-34:26807:27073 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
ip-10-7-6-34:26814:27100 [7] NCCL INFO Ring 00 : 7[7] -> 0[0] via P2P/IPC
ip-10-7-6-34:26808:27144 [1] NCCL INFO Ring 01 : 1[1] -> 2[2] via P2P/IPC
ip-10-7-6-34:26809:27139 [2] NCCL INFO Ring 01 : 2[2] -> 3[3] via P2P/IPC
ip-10-7-6-34:26810:27104 [3] NCCL INFO Ring 01 : 3[3] -> 4[4] via P2P/IPC
ip-10-7-6-34:26812:27154 [5] NCCL INFO Ring 01 : 5[5] -> 6[6] via P2P/IPC
ip-10-7-6-34:26811:27118 [4] NCCL INFO Ring 01 : 4[4] -> 5[5] via P2P/IPC
ip-10-7-6-34:26813:27129 [6] NCCL INFO Ring 01 : 6[6] -> 7[7] via P2P/IPC
ip-10-7-6-34:26807:27073 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/IPC
ip-10-7-6-34:26814:27100 [7] NCCL INFO Ring 01 : 7[7] -> 0[0] via P2P/IPC
ip-10-7-6-34:26808:27144 [1] NCCL INFO Ring 02 : 1[1] -> 2[2] via P2P/IPC
ip-10-7-6-34:26809:27139 [2] NCCL INFO Ring 02 : 2[2] -> 3[3] via P2P/IPC
ip-10-7-6-34:26810:27104 [3] NCCL INFO Ring 02 : 3[3] -> 4[4] via P2P/IPC
ip-10-7-6-34:26812:27154 [5] NCCL INFO Ring 02 : 5[5] -> 6[6] via P2P/IPC
ip-10-7-6-34:26811:27118 [4] NCCL INFO Ring 02 : 4[4] -> 5[5] via P2P/IPC
ip-10-7-6-34:26813:27129 [6] NCCL INFO Ring 02 : 6[6] -> 7[7] via P2P/IPC
ip-10-7-6-34:26807:27073 [0] NCCL INFO Ring 02 : 0[0] -> 1[1] via P2P/IPC
ip-10-7-6-34:26814:27100 [7] NCCL INFO Ring 02 : 7[7] -> 0[0] via P2P/IPC
ip-10-7-6-34:26808:27144 [1] NCCL INFO Ring 03 : 1[1] -> 2[2] via P2P/IPC
ip-10-7-6-34:26809:27139 [2] NCCL INFO Ring 03 : 2[2] -> 3[3] via P2P/IPC
ip-10-7-6-34:26810:27104 [3] NCCL INFO Ring 03 : 3[3] -> 4[4] via P2P/IPC
ip-10-7-6-34:26812:27154 [5] NCCL INFO Ring 03 : 5[5] -> 6[6] via P2P/IPC
ip-10-7-6-34:26811:27118 [4] NCCL INFO Ring 03 : 4[4] -> 5[5] via P2P/IPC
ip-10-7-6-34:26813:27129 [6] NCCL INFO Ring 03 : 6[6] -> 7[7] via P2P/IPC
ip-10-7-6-34:26807:27073 [0] NCCL INFO Ring 03 : 0[0] -> 1[1] via P2P/IPC
ip-10-7-6-34:26814:27100 [7] NCCL INFO Ring 03 : 7[7] -> 0[0] via P2P/IPC
ip-10-7-6-34:26808:27144 [1] NCCL INFO Ring 04 : 1[1] -> 2[2] via P2P/IPC
ip-10-7-6-34:26809:27139 [2] NCCL INFO Ring 04 : 2[2] -> 3[3] via P2P/IPC
ip-10-7-6-34:26810:27104 [3] NCCL INFO Ring 04 : 3[3] -> 4[4] via P2P/IPC
ip-10-7-6-34:26812:27154 [5] NCCL INFO Ring 04 : 5[5] -> 6[6] via P2P/IPC
ip-10-7-6-34:26811:27118 [4] NCCL INFO Ring 04 : 4[4] -> 5[5] via P2P/IPC
ip-10-7-6-34:26813:27129 [6] NCCL INFO Ring 04 : 6[6] -> 7[7] via P2P/IPC
ip-10-7-6-34:26807:27073 [0] NCCL INFO Ring 04 : 0[0] -> 1[1] via P2P/IPC
ip-10-7-6-34:26814:27100 [7] NCCL INFO Ring 04 : 7[7] -> 0[0] via P2P/IPC
ip-10-7-6-34:26808:27144 [1] NCCL INFO Ring 05 : 1[1] -> 2[2] via P2P/IPC
ip-10-7-6-34:26809:27139 [2] NCCL INFO Ring 05 : 2[2] -> 3[3] via P2P/IPC
ip-10-7-6-34:26810:27104 [3] NCCL INFO Ring 05 : 3[3] -> 4[4] via P2P/IPC
ip-10-7-6-34:26812:27154 [5] NCCL INFO Ring 05 : 5[5] -> 6[6] via P2P/IPC
ip-10-7-6-34:26811:27118 [4] NCCL INFO Ring 05 : 4[4] -> 5[5] via P2P/IPC
ip-10-7-6-34:26813:27129 [6] NCCL INFO Ring 05 : 6[6] -> 7[7] via P2P/IPC
ip-10-7-6-34:26807:27073 [0] NCCL INFO Ring 05 : 0[0] -> 1[1] via P2P/IPC
ip-10-7-6-34:26814:27100 [7] NCCL INFO Ring 05 : 7[7] -> 0[0] via P2P/IPC
ip-10-7-6-34:26808:27144 [1] NCCL INFO Ring 06 : 1[1] -> 2[2] via P2P/IPC
ip-10-7-6-34:26809:27139 [2] NCCL INFO Ring 06 : 2[2] -> 3[3] via P2P/IPC
ip-10-7-6-34:26810:27104 [3] NCCL INFO Ring 06 : 3[3] -> 4[4] via P2P/IPC
ip-10-7-6-34:26812:27154 [5] NCCL INFO Ring 06 : 5[5] -> 6[6] via P2P/IPC
ip-10-7-6-34:26811:27118 [4] NCCL INFO Ring 06 : 4[4] -> 5[5] via P2P/IPC
ip-10-7-6-34:26813:27129 [6] NCCL INFO Ring 06 : 6[6] -> 7[7] via P2P/IPC
ip-10-7-6-34:26807:27073 [0] NCCL INFO Ring 06 : 0[0] -> 1[1] via P2P/IPC
ip-10-7-6-34:26814:27100 [7] NCCL INFO Ring 06 : 7[7] -> 0[0] via P2P/IPC
ip-10-7-6-34:26808:27144 [1] NCCL INFO Ring 07 : 1[1] -> 2[2] via P2P/IPC
ip-10-7-6-34:26809:27139 [2] NCCL INFO Ring 07 : 2[2] -> 3[3] via P2P/IPC
ip-10-7-6-34:26810:27104 [3] NCCL INFO Ring 07 : 3[3] -> 4[4] via P2P/IPC
ip-10-7-6-34:26812:27154 [5] NCCL INFO Ring 07 : 5[5] -> 6[6] via P2P/IPC
ip-10-7-6-34:26811:27118 [4] NCCL INFO Ring 07 : 4[4] -> 5[5] via P2P/IPC
ip-10-7-6-34:26813:27129 [6] NCCL INFO Ring 07 : 6[6] -> 7[7] via P2P/IPC
ip-10-7-6-34:26807:27073 [0] NCCL INFO Ring 07 : 0[0] -> 1[1] via P2P/IPC
ip-10-7-6-34:26814:27100 [7] NCCL INFO Ring 07 : 7[7] -> 0[0] via P2P/IPC
ip-10-7-6-34:26808:27144 [1] NCCL INFO Ring 08 : 1[1] -> 2[2] via P2P/IPC
ip-10-7-6-34:26809:27139 [2] NCCL INFO Ring 08 : 2[2] -> 3[3] via P2P/IPC
ip-10-7-6-34:26810:27104 [3] NCCL INFO Ring 08 : 3[3] -> 4[4] via P2P/IPC
ip-10-7-6-34:26812:27154 [5] NCCL INFO Ring 08 : 5[5] -> 6[6] via P2P/IPC
ip-10-7-6-34:26811:27118 [4] NCCL INFO Ring 08 : 4[4] -> 5[5] via P2P/IPC
ip-10-7-6-34:26813:27129 [6] NCCL INFO Ring 08 : 6[6] -> 7[7] via P2P/IPC
ip-10-7-6-34:26807:27073 [0] NCCL INFO Ring 08 : 0[0] -> 1[1] via P2P/IPC
ip-10-7-6-34:26814:27100 [7] NCCL INFO Ring 08 : 7[7] -> 0[0] via P2P/IPC
ip-10-7-6-34:26808:27144 [1] NCCL INFO Ring 09 : 1[1] -> 2[2] via P2P/IPC
ip-10-7-6-34:26809:27139 [2] NCCL INFO Ring 09 : 2[2] -> 3[3] via P2P/IPC
ip-10-7-6-34:26810:27104 [3] NCCL INFO Ring 09 : 3[3] -> 4[4] via P2P/IPC
ip-10-7-6-34:26812:27154 [5] NCCL INFO Ring 09 : 5[5] -> 6[6] via P2P/IPC
ip-10-7-6-34:26811:27118 [4] NCCL INFO Ring 09 : 4[4] -> 5[5] via P2P/IPC
ip-10-7-6-34:26813:27129 [6] NCCL INFO Ring 09 : 6[6] -> 7[7] via P2P/IPC
ip-10-7-6-34:26807:27073 [0] NCCL INFO Ring 09 : 0[0] -> 1[1] via P2P/IPC
ip-10-7-6-34:26814:27100 [7] NCCL INFO Ring 09 : 7[7] -> 0[0] via P2P/IPC
ip-10-7-6-34:26808:27144 [1] NCCL INFO Ring 10 : 1[1] -> 2[2] via P2P/IPC
ip-10-7-6-34:26809:27139 [2] NCCL INFO Ring 10 : 2[2] -> 3[3] via P2P/IPC
ip-10-7-6-34:26810:27104 [3] NCCL INFO Ring 10 : 3[3] -> 4[4] via P2P/IPC
ip-10-7-6-34:26812:27154 [5] NCCL INFO Ring 10 : 5[5] -> 6[6] via P2P/IPC
ip-10-7-6-34:26811:27118 [4] NCCL INFO Ring 10 : 4[4] -> 5[5] via P2P/IPC
ip-10-7-6-34:26813:27129 [6] NCCL INFO Ring 10 : 6[6] -> 7[7] via P2P/IPC
ip-10-7-6-34:26807:27073 [0] NCCL INFO Ring 10 : 0[0] -> 1[1] via P2P/IPC
ip-10-7-6-34:26814:27100 [7] NCCL INFO Ring 10 : 7[7] -> 0[0] via P2P/IPC
ip-10-7-6-34:26808:27144 [1] NCCL INFO Ring 11 : 1[1] -> 2[2] via P2P/IPC
ip-10-7-6-34:26810:27104 [3] NCCL INFO Ring 11 : 3[3] -> 4[4] via P2P/IPC
ip-10-7-6-34:26809:27139 [2] NCCL INFO Ring 11 : 2[2] -> 3[3] via P2P/IPC
ip-10-7-6-34:26812:27154 [5] NCCL INFO Ring 11 : 5[5] -> 6[6] via P2P/IPC
ip-10-7-6-34:26811:27118 [4] NCCL INFO Ring 11 : 4[4] -> 5[5] via P2P/IPC
ip-10-7-6-34:26813:27129 [6] NCCL INFO Ring 11 : 6[6] -> 7[7] via P2P/IPC
ip-10-7-6-34:26807:27073 [0] NCCL INFO Ring 11 : 0[0] -> 1[1] via P2P/IPC
ip-10-7-6-34:26814:27100 [7] NCCL INFO Ring 11 : 7[7] -> 0[0] via P2P/IPC
ip-10-7-6-34:26807:27073 [0] NCCL INFO Using 256 threads, Min Comp Cap 8, Trees disabled
ip-10-7-6-34:26808:27144 [1] NCCL INFO comm 0x7fa7e40028a0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 - Init COMPLETE
ip-10-7-6-34:26808:26808 [1] enqueue.cc:197 NCCL WARN Cuda failure 'invalid device function'
ip-10-7-6-34:26808:26808 [1] NCCL INFO misc/group.cc:148 -> 1
ip-10-7-6-34:26812:27154 [5] NCCL INFO comm 0x7f21f40028a0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 - Init COMPLETE
ip-10-7-6-34:26813:27129 [6] NCCL INFO comm 0x7f74c40028a0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 - Init COMPLETE
ip-10-7-6-34:26813:26813 [6] enqueue.cc:197 NCCL WARN Cuda failure 'invalid device function'
ip-10-7-6-34:26813:26813 [6] NCCL INFO misc/group.cc:148 -> 1
ip-10-7-6-34:26810:27104 [3] NCCL INFO comm 0x7f2d640028a0 rank 3 nranks 8 cudaDev 3 nvmlDev 3 - Init COMPLETE
ip-10-7-6-34:26812:26812 [5] enqueue.cc:197 NCCL WARN Cuda failure 'invalid device function'
ip-10-7-6-34:26812:26812 [5] NCCL INFO misc/group.cc:148 -> 1
ip-10-7-6-34:26810:26810 [3] enqueue.cc:197 NCCL WARN Cuda failure 'invalid device function'
ip-10-7-6-34:26810:26810 [3] NCCL INFO misc/group.cc:148 -> 1
ip-10-7-6-34:26809:27139 [2] NCCL INFO comm 0x7f2e380028a0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 - Init COMPLETE
ip-10-7-6-34:26811:27118 [4] NCCL INFO comm 0x7fc5f00028a0 rank 4 nranks 8 cudaDev 4 nvmlDev 4 - Init COMPLETE
ip-10-7-6-34:26809:26809 [2] enqueue.cc:197 NCCL WARN Cuda failure 'invalid device function'
ip-10-7-6-34:26809:26809 [2] NCCL INFO misc/group.cc:148 -> 1
ip-10-7-6-34:26814:27100 [7] NCCL INFO comm 0x7f09080028a0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 - Init COMPLETE
ip-10-7-6-34:26807:27073 [0] NCCL INFO comm 0x7f50e40028a0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 - Init COMPLETE
ip-10-7-6-34:26811:26811 [4] enqueue.cc:197 NCCL WARN Cuda failure 'invalid device function'
ip-10-7-6-34:26811:26811 [4] NCCL INFO misc/group.cc:148 -> 1
ip-10-7-6-34:26807:26807 [0] NCCL INFO Launch mode Parallel
ip-10-7-6-34:26807:26807 [0] enqueue.cc:197 NCCL WARN Cuda failure 'invalid device function'
ip-10-7-6-34:26807:26807 [0] NCCL INFO misc/group.cc:148 -> 1
ip-10-7-6-34:26814:26814 [7] enqueue.cc:197 NCCL WARN Cuda failure 'invalid device function'
ip-10-7-6-34:26814:26814 [7] NCCL INFO misc/group.cc:148 -> 1
Traceback (most recent call last):
File "/home/ubuntu/train.py", line 315, in <module>
cli_main()
File "/home/ubuntu/train.py", line 307, in cli_main
nprocs=args.distributed_world_size,
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 158, in tart_processes
while not context.join():
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 5 terminated with the following error:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/home/ubuntu/train.py", line 274, in distributed_main
File "/home/ubuntu/train.py", line 36, in main
args.distributed_rank = distributed_utils.distributed_init(args)
File "/home/ubuntu/fairseq/distributed_utils.py", line 85, in distributed_init
dist.all_reduce(torch.rand(1).cuda())
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 898, in all_reduce
work = _default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1587428091666/work/torch/lib/c10d/ProcessGroupNCCL.cpp:32, unhandled cuda error, NCCL version 2.4.8
Environment:
- Aws DeepLearning AMI with Ubuntu 18.04
- Cuda: 10.1
- PyTorch: 1.5.0 with Cuda 10.1
- Fairseq: 0.9
- NCCL version: 2.4.8
- GPU type: Nvidia - A100
- Number of GPUs - 8
- Number of Nodes - 1
I have referred the following GitHub issues and tried out most of the suggestions which are given there but I’m not able to resolve this issue.
- BART-Large: RuntimeError: CUDA error: the launch timed out and was terminated · Issue #2311 · pytorch/fairseq · GitHub
- Pytorch 1.5.0 (installed from conda) errors with complaints about incompatibility between MKL and libgomp when using Pytorch's multiprocessing · Issue #37377 · pytorch/pytorch · GitHub
- NCCL backend fails when calling broadcast from different threads · Issue #18300 · pytorch/pytorch · GitHub
- unhandled cuda error while training using multiple nodes · Issue #973 · pytorch/fairseq · GitHub
- Crash when initializing distributed training across 2 machines - #4 by naykun