The connection to the C10d store has failed

I am trying to run my code on two servers having one GPU, I am trying a very simple code (that I already tested on a single machine with 2 GPU and works fine) I added some codes for the global rank and local rank to run on multi node form. but I am getting this error:
I use these commands to run my code:
considering first one for master node.

  • $ torchrun --nproc_per_node=1 --nnodes=2 --node_rank=0 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=10.30.7.22:29603 ddp-cifar100-multinode.py --epochs 10 --batch-size 16

  • $torchrun --nproc_per_node=1 --nnodes=2 --node_rank=1 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=10.30.7.22:29603 ddp-cifar100-multinode.py --epochs 10 --batch-size 16

ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:{
“message”: {
“message”: “RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.”,
“extraInfo”: {
“py_callstack”: “Traceback (most recent call last):\n File "/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 156, in _create_tcp_store\n host, port, is_master=is_server, timeout=timedelta(seconds=read_timeout)\nRuntimeError: connect() timed out. Original timeout was 60000 ms.\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n File "/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper\n return f(*args, **kwargs)\n File "/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/run.py", line 719, in main\n run(args)\n File "/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run\n )(*cmd_args)\n File "/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in call\n return launch_agent(self._config, self._entrypoint, list(args))\n File "/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 228, in launch_agent\n rdzv_handler = rdzv_registry.get_rendezvous_handler(rdzv_parameters)\n File "/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 64, in get_rendezvous_handler\n return handler_registry.create_handler(params)\n File "/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/elastic/rendezvous/api.py", line 253, in create_handler\n handler = creator(params)\n File "/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 35, in _create_c10d_handler\n backend, store = create_backend(params)\n File "/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 250, in create_backend\n store = _create_tcp_store(params)\n File "/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 177, in _create_tcp_store\n ) from exc\ntorch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.\n”,
“timestamp”: “1715699987”
}
}
}
Traceback (most recent call last):
File “/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py”, line 156, in _create_tcp_store
host, port, is_master=is_server, timeout=timedelta(seconds=read_timeout)
RuntimeError: connect() timed out. Original timeout was 60000 ms.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “/home/omid/omid/omid_env/bin/torchrun”, line 11, in
sys.exit(main())
File “/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 345, in wrapper
return f(*args, **kwargs)
File “/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/run.py”, line 719, in main
run(args)
File “/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/run.py”, line 713, in run
)(*cmd_args)
File “/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/launcher/api.py”, line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/launcher/api.py”, line 228, in launch_agent
rdzv_handler = rdzv_registry.get_rendezvous_handler(rdzv_parameters)
File “/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/elastic/rendezvous/registry.py”, line 64, in get_rendezvous_handler
return handler_registry.create_handler(params)
File “/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/elastic/rendezvous/api.py”, line 253, in create_handler
handler = creator(params)
File “/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/elastic/rendezvous/registry.py”, line 35, in _create_c10d_handler
backend, store = create_backend(params)
File “/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py”, line 250, in create_backend
store = _create_tcp_store(params)
File “/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py”, line 177, in _create_tcp_store
) from exc
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.

Hi omid
I had the same problem with you, and dont know how to solve it.
I just follow the tutorial from pytorch forum, but it’s just freezed.
If you find solution other place, please let me know.

I am still facing this problem the solution is very strange, I have three different servers just fixed one of them. I downgraded everything such as Python 3.8.10 Cuda 10.1 and pytorch 1.9 so It worked but the problem is I have done the same for other two but still same error! I am trying to solve it asap and let you know if I find anything useful. And I found out that my code is fine and the problem is configuration not the code. so try to be sure about your code or pick something that is already tested

yea, you are right. It’s the configuration of the machines.
I just asked our cluster engineering team to send me some successed run example on the same cluster.
And I find it’s worked after I change the docker image and some NCCL setting.
So I recommend you check this too.

1 Like

exactly now i fixed two machines but I am facing a problem of NCCL and it says that there is an issue with nccl! should I change version of something? can you tell me what steps you have done to fix your problem? because now my code is running but I have two errors that one of them is nccl and the other says rank1 got 8 params while rank0 got 0 !

It does seem like a host configuration issue where the hosts cannot talk to each other.

One way to single out errors between NCCL and pytorch distributed is to create a sample script that just creates a Store. This is what is used to bootstrap the process groups and then nccl is initialized afterwards. If the store is able to be initialized correctly, then it would be a NCCL issue. However, based on the stack trace it doesn’t look like a NCCL issue.

2 Likes

Thank you I fixed that problem (at least it looks like) but I have issues with nccl ! I am getting this error. everything works fine on single machine but when going to multi node I am getting this error:

Files already downloaded and verified
smartedge:949125:949125 [0] NCCL INFO cudaDriverVersion 12040
smartedge:949125:949125 [0] NCCL INFO Bootstrap : Using eno8303:10.30.2.58<0>
smartedge:949125:949125 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
smartedge:949125:949125 [0] NCCL INFO init.cc:1732 Cuda Host Alloc Size 4 pointer 0x7ff1c3200000
smartedge:949125:949227 [0] NCCL INFO NET/IB : Using [0]={[0] mlx5_0:1/RoCE, [1] mlx5_1:1/RoCE} [RO]; OOB eno8303:10.30.2.58<0>
smartedge:949125:949227 [0] NCCL INFO Using non-device net plugin version 0
smartedge:949125:949227 [0] NCCL INFO Using network IB
smartedge:949125:949227 [0] NCCL INFO comm 0x6c3f9a0 rank 1 nranks 2 cudaDev 0 nvmlDev 0 busId 12000 commId 0x4fb96858dddc545 - Init START
smartedge:949125:949227 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 0 ‘mlx5_0+mlx5_1’
smartedge:949125:949227 [0] NCCL INFO === System : maxBw 12.0 totalBw 12.0 ===
smartedge:949125:949227 [0] NCCL INFO CPU/0 (1/1/2)
smartedge:949125:949227 [0] NCCL INFO + PCI[12.0] - PCI/D000 (15b3197800000000)
smartedge:949125:949227 [0] NCCL INFO + PCI[24.0] - NIC/F000
smartedge:949125:949227 [0] NCCL INFO + NET[25.0] - NET/0 (b0e48f00032db048/1/25.000000)
smartedge:949125:949227 [0] NCCL INFO + PCI[24.0] - PCI/10000 (15b3197800000000)
smartedge:949125:949227 [0] NCCL INFO + PCI[12.0] - GPU/12000 (1)
smartedge:949125:949227 [0] NCCL INFO ==========================================
smartedge:949125:949227 [0] NCCL INFO GPU/12000 :GPU/12000 (0/5000.000000/LOC) CPU/0 (3/12.000000/PHB) NET/0 (6/12.000000/PHB)
smartedge:949125:949227 [0] NCCL INFO NET/0 :GPU/12000 (6/12.000000/PHB) CPU/0 (3/12.000000/PHB) NET/0 (0/5000.000000/LOC)
smartedge:949125:949227 [0] NCCL INFO Setting affinity for GPU 0 to 55555555,55555555,55555555,55555555
smartedge:949125:949227 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 5.000000/5.000000, type LOC/PHB, sameChannels 1
smartedge:949125:949227 [0] NCCL INFO 0 : NET/0 GPU/1 NET/0
smartedge:949125:949227 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 1, bw 10.000000/5.000000, type LOC/PHB, sameChannels 1
smartedge:949125:949227 [0] NCCL INFO 0 : NET/0 GPU/1 NET/0
smartedge:949125:949227 [0] NCCL INFO comm 0x6c3f9a0 rank 1 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0
smartedge:949125:949227 [0] NCCL INFO Tree 0 : 0 → 1 → -1/-1/-1
smartedge:949125:949227 [0] NCCL INFO Tree 1 : -1 → 1 → 0/-1/-1
smartedge:949125:949227 [0] NCCL INFO Ring 00 : 0 → 1 → 0
smartedge:949125:949227 [0] NCCL INFO Ring 01 : 0 → 1 → 0
smartedge:949125:949227 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
smartedge:949125:949227 [0] NCCL INFO P2P Chunksize set to 131072
smartedge:949125:949227 [0] NCCL INFO UDS: Creating service thread comm 0x6c3f9a0 rank 1
smartedge:949125:949227 [0] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
smartedge:949125:949227 [0] NCCL INFO channel.cc:40 Cuda Alloc Size 1152 pointer 0x7ff1a8e00000
smartedge:949125:949227 [0] NCCL INFO channel.cc:43 Cuda Alloc Size 32 pointer 0x7ff1a9000000
smartedge:949125:949227 [0] NCCL INFO channel.cc:54 Cuda Alloc Size 8 pointer 0x7ff1a9200000
smartedge:949125:949227 [0] NCCL INFO channel.cc:40 Cuda Alloc Size 1152 pointer 0x7ff1a9400000
smartedge:949125:949227 [0] NCCL INFO channel.cc:43 Cuda Alloc Size 32 pointer 0x7ff1a9600000
smartedge:949125:949227 [0] NCCL INFO channel.cc:54 Cuda Alloc Size 8 pointer 0x7ff1a9800000
smartedge:949125:949230 [0] NCCL INFO Mem Realloc old size 0, new size 8 pointer 0x7ff1a4004f10
smartedge:949125:949230 [0] NCCL INFO Allocated 5767524 bytes of shared memory in /dev/shm/nccl-8t3VWy
smartedge:949125:949230 [0] NCCL INFO New proxy recv connection 0 from local rank 0, transport 2
smartedge:949125:949230 [0] NCCL INFO proxyProgressAsync opId=0x7ff1b35ada30 op.type=1 op.reqBuff=0x7ff1a4004ed0 op.respSize=16 done
smartedge:949125:949227 [0] NCCL INFO ncclPollProxyResponse Received new opId=0x7ff1b35ada30
smartedge:949125:949227 [0] NCCL INFO resp.opId=0x7ff1b35ada30 matches expected opId=0x7ff1b35ada30
smartedge:949125:949230 [0] NCCL INFO Received and initiated operation=Init res=0
smartedge:949125:949227 [0] NCCL INFO Connected to proxy localRank 0 → connection 0x7ff1a4004f30
smartedge:949125:949230 [0] NCCL INFO proxyProgressAsync opId=0x7ff1b35ada30 op.type=3 op.reqBuff=0x7ff1a4008e30 op.respSize=128 done
smartedge:949125:949227 [0] NCCL INFO ncclPollProxyResponse Received new opId=0x7ff1b35ada30
smartedge:949125:949230 [0] NCCL INFO Received and initiated operation=Setup res=0
smartedge:949125:949227 [0] NCCL INFO resp.opId=0x7ff1b35ada30 matches expected opId=0x7ff1b35ada30
smartedge:949125:949227 [0] NCCL INFO Channel 00/0 : 0[0] → 1[0] [receive] via NET/IB/0
smartedge:949125:949230 [0] NCCL INFO New proxy recv connection 1 from local rank 0, transport 2
smartedge:949125:949230 [0] NCCL INFO proxyProgressAsync opId=0x7ff1b35ada30 op.type=1 op.reqBuff=0x7ff1a400e280 op.respSize=16 done
smartedge:949125:949227 [0] NCCL INFO ncclPollProxyResponse Received new opId=0x7ff1b35ada30
smartedge:949125:949230 [0] NCCL INFO Received and initiated operation=Init res=0
smartedge:949125:949227 [0] NCCL INFO resp.opId=0x7ff1b35ada30 matches expected opId=0x7ff1b35ada30
smartedge:949125:949227 [0] NCCL INFO Connected to proxy localRank 0 → connection 0x7ff1a4004fa8
smartedge:949125:949230 [0] NCCL INFO proxyProgressAsync opId=0x7ff1b35ada30 op.type=3 op.reqBuff=0x7ff1a400e2c0 op.respSize=128 done
smartedge:949125:949227 [0] NCCL INFO ncclPollProxyResponse Received new opId=0x7ff1b35ada30
smartedge:949125:949230 [0] NCCL INFO Received and initiated operation=Setup res=0
smartedge:949125:949227 [0] NCCL INFO resp.opId=0x7ff1b35ada30 matches expected opId=0x7ff1b35ada30
smartedge:949125:949227 [0] NCCL INFO Channel 01/0 : 0[0] → 1[0] [receive] via NET/IB/0
smartedge:949125:949230 [0] NCCL INFO New proxy send connection 2 from local rank 0, transport 2
smartedge:949125:949230 [0] NCCL INFO proxyProgressAsync opId=0x7ff1b35ada30 op.type=1 op.reqBuff=0x7ff1a4013710 op.respSize=16 done
smartedge:949125:949227 [0] NCCL INFO ncclPollProxyResponse Received new opId=0x7ff1b35ada30
smartedge:949125:949230 [0] NCCL INFO Received and initiated operation=Init res=0
smartedge:949125:949227 [0] NCCL INFO resp.opId=0x7ff1b35ada30 matches expected opId=0x7ff1b35ada30
smartedge:949125:949227 [0] NCCL INFO Connected to proxy localRank 0 → connection 0x7ff1a4005020
smartedge:949125:949230 [0] NCCL INFO proxyProgressAsync opId=0x7ff1b35ada30 op.type=3 op.reqBuff=0x7ff1a4013750 op.respSize=0 done
smartedge:949125:949230 [0] NCCL INFO Received and initiated operation=Setup res=0
smartedge:949125:949227 [0] NCCL INFO ncclPollProxyResponse Received new opId=0x7ff1b35ada30
smartedge:949125:949227 [0] NCCL INFO resp.opId=0x7ff1b35ada30 matches expected opId=0x7ff1b35ada30
smartedge:949125:949227 [0] NCCL INFO Channel 00/0 : 1[0] → 0[0] [send] via NET/IB/0
smartedge:949125:949230 [0] NCCL INFO New proxy send connection 3 from local rank 0, transport 2
smartedge:949125:949230 [0] NCCL INFO proxyProgressAsync opId=0x7ff1b35ada30 op.type=1 op.reqBuff=0x7ff1a4018a70 op.respSize=16 done
smartedge:949125:949227 [0] NCCL INFO ncclPollProxyResponse Received new opId=0x7ff1b35ada30
smartedge:949125:949230 [0] NCCL INFO Received and initiated operation=Init res=0
smartedge:949125:949227 [0] NCCL INFO resp.opId=0x7ff1b35ada30 matches expected opId=0x7ff1b35ada30
smartedge:949125:949227 [0] NCCL INFO Connected to proxy localRank 0 → connection 0x7ff1a4005098
smartedge:949125:949230 [0] NCCL INFO proxyProgressAsync opId=0x7ff1b35ada30 op.type=3 op.reqBuff=0x7ff1a4018ab0 op.respSize=0 done
smartedge:949125:949230 [0] NCCL INFO Received and initiated operation=Setup res=0
smartedge:949125:949227 [0] NCCL INFO ncclPollProxyResponse Received new opId=0x7ff1b35ada30
smartedge:949125:949227 [0] NCCL INFO resp.opId=0x7ff1b35ada30 matches expected opId=0x7ff1b35ada30
smartedge:949125:949227 [0] NCCL INFO Channel 01/0 : 1[0] → 0[0] [send] via NET/IB/0
smartedge:949125:949227 [0] NCCL INFO sendConnect ncclProxyCallAsync opId=0x7ff1b379c000
smartedge:949125:949227 [0] NCCL INFO recvConnect ncclProxyCallAsync opId=0x7ff1b379c190 &recv->proxyConn=0x7ff1b379c198 connectInfo=0x7ff1b39d3e40
smartedge:949125:949227 [0] NCCL INFO sendConnect ncclProxyCallAsync opId=0x7ff1b39ce630
smartedge:949125:949227 [0] NCCL INFO recvConnect ncclProxyCallAsync opId=0x7ff1b39ce7c0 &recv->proxyConn=0x7ff1b39ce7c8 connectInfo=0x7ff1b39d3ec0
smartedge:949125:949230 [0] NCCL INFO transport/net_ib.cc:773 Ib Alloc Size 181536 pointer 0x7ff1a4024000
smartedge:949125:949230 [0] NCCL INFO NET/IB: NCCL Dev 0 IbDev 0 Port 1 qpn 252 mtu 5 query_ece={supported=1, vendor_id=0x15b3, options=0x30000002, comp_mask=0x0} GID 0 (80FE/B0E48FFEFF2DB04A) fifoRkey=0x4d999 fifoLkey=0x4d999
smartedge:949125:949230 [0] NCCL INFO NET/IB: NCCL Dev 0 IbDev 1 Port 1 qpn 496 mtu 5 query_ece={supported=1, vendor_id=0x15b3, options=0x30000002, comp_mask=0x0} GID 0 (80FE/B1E48FFEFF2DB04A) fifoRkey=0x87f7b fifoLkey=0x87f7b
smartedge:949125:949230 [0] NCCL INFO transport/net_ib.cc:867 Ib Alloc Size 3336 pointer 0x7ff1a40dc000
smartedge:949125:949230 [0] NCCL INFO Received and initiated operation=Connect res=0
smartedge:949125:949230 [0] NCCL INFO transport/net_ib.cc:978 Ib Alloc Size 164416 pointer 0x7ff1a40e3000
smartedge:949125:949230 [0] NCCL INFO transport/net_ib.cc:991 Ib Alloc Size 3336 pointer 0x7ff1a410d000

smartedge:949125:949230 [0] transport/net_ib.cc:1017 NCCL WARN NET/IB : Local mergedDev mlx5_0+mlx5_1 has a different number of devices=2 as remote rocep13s0f0 1
smartedge:949125:949230 [0] NCCL INFO transport/net_ib.cc:1132 Ib Alloc Size 3336 pointer 0x7ff1a41d9000
smartedge:949125:949230 [0] NCCL INFO Received and initiated operation=Connect res=0

smartedge:949125:949230 [0] transport/net_ib.cc:891 NCCL WARN NET/IB : Local mergedDev=mlx5_0+mlx5_1 has a different number of devices=2 as remoteDev=rocep13s0f0 nRemDevs=1
smartedge:949125:949230 [0] NCCL INFO Call to ibv_set_ece failed with error Operation not supported errno 95

smartedge:949125:949230 [0] misc/ibvwrap.cc:206 NCCL WARN Call to ibv_modify_qp failed with error Invalid argument errno 22
smartedge:949125:949230 [0] NCCL INFO transport/net_ib.cc:725 → 2
smartedge:949125:949230 [0] NCCL INFO transport/net_ib.cc:931 → 2
smartedge:949125:949230 [0] NCCL INFO transport/net.cc:683 → 2
smartedge:949125:949230 [0] NCCL INFO proxyProgressAsync opId=0x7ff1b379c000 op.type=4 op.reqBuff=0x7ff1a401ddd0 op.respSize=21040 done
smartedge:949125:949227 [0] NCCL INFO ncclPollProxyResponse Received new opId=0x7ff1b379c000
smartedge:949125:949227 [0] NCCL INFO Queuing opId=0x7ff1b379c000 respBuff=0x7ff1b39ff070 respSize=21040
smartedge:949125:949227 [0] NCCL INFO ncclPollProxyResponse Dequeued cached opId=0x7ff1b379c000
smartedge:949125:949227 [0] NCCL INFO transport/net.cc:304 → 2
smartedge:949125:949227 [0] NCCL INFO transport.cc:165 → 2
smartedge:949125:949227 [0] NCCL INFO init.cc:1222 → 2
smartedge:949125:949227 [0] NCCL INFO init.cc:1501 → 2
smartedge:949125:949227 [0] NCCL INFO group.cc:64 → 2 [Async thread]
smartedge:949125:949125 [0] NCCL INFO group.cc:418 → 2
smartedge:949125:949125 [0] NCCL INFO group.cc:95 → 2
smartedge:949125:949230 [0] NCCL INFO transport/net_ib.cc:773 Ib Alloc Size 181536 pointer 0x7ff1a41db000
smartedge:949125:949230 [0] NCCL INFO NET/IB: NCCL Dev 0 IbDev 0 Port 1 qpn 255 mtu 5 query_ece={supported=1, vendor_id=0x15b3, options=0x30000002, comp_mask=0x0} GID 0 (80FE/B0E48FFEFF2DB04A) fifoRkey=0x4efae fifoLkey=0x4efae
smartedge:949125:949230 [0] NCCL INFO NET/IB: NCCL Dev 0 IbDev 1 Port 1 qpn 499 mtu 5 query_ece={supported=1, vendor_id=0x15b3, options=0x30000002, comp_mask=0x0} GID 0 (80FE/B1E48FFEFF2DB04A) fifoRkey=0x89897 fifoLkey=0x89897
smartedge:949125:949230 [0] NCCL INFO transport/net_ib.cc:867 Ib Alloc Size 3336 pointer 0x7ff1a427f000
smartedge:949125:949230 [0] NCCL INFO proxy.cc:1425 → 3
smartedge:949125:949230 [0] NCCL INFO Received and initiated operation=Connect res=3

smartedge:949125:949230 [0] proxy.cc:1567 NCCL WARN [Proxy Service 1] Failed to execute operation Connect from rank 1, retcode 3
[rank1]: Traceback (most recent call last):
[rank1]: File “ddp-cifar100-multinode.py”, line 142, in
[rank1]: main(args.epochs, args.batch_size)
[rank1]: File “ddp-cifar100-multinode.py”, line 128, in main
[rank1]: trainer = Trainer(model, train_data, optimizer, save_every)
[rank1]: File “ddp-cifar100-multinode.py”, line 59, in init
[rank1]: self.model = DDP(model, device_ids=[self.local_rank])
[rank1]: File “/home/adminsssa/env1/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 798, in init
[rank1]: _verify_param_shape_across_processes(self.process_group, parameters)
[rank1]: File “/home/adminsssa/env1/lib/python3.8/site-packages/torch/distributed/utils.py”, line 269, in _verify_param_shape_across_processes
[rank1]: return dist._verify_params_across_processes(process_group, tensors, logger)
[rank1]: torch.distributed.DistBackendError: NCCL error in: …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank1]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank1]: Last error:
[rank1]: Call to ibv_modify_qp failed with error Invalid argument errno 22
smartedge:949125:949231 [0] NCCL INFO [Proxy Service UDS] exit: stop 1 abortFlag 1
smartedge:949125:949233 [0] NCCL INFO comm 0x6c3f9a0 rank 1 nranks 2 cudaDev 0 busId 12000 - Abort COMPLETE
E0602 19:37:41.145897 140028373555008 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 949125) of binary: /home/adminsssa/env1/bin/python3
Traceback (most recent call last):
File “/home/adminsssa/env1/bin/torchrun”, line 8, in
sys.exit(main())
File “/home/adminsssa/env1/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 347, in wrapper
return f(*args, **kwargs)
File “/home/adminsssa/env1/lib/python3.8/site-packages/torch/distributed/run.py”, line 879, in main
run(args)
File “/home/adminsssa/env1/lib/python3.8/site-packages/torch/distributed/run.py”, line 870, in run
elastic_launch(
File “/home/adminsssa/env1/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/home/adminsssa/env1/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

ddp-cifar100-multinode.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-06-02_19:37:41
host : smartedge
rank : 1 (local_rank: 0)
exitcode : 1 (pid: 949125)
error_file: <N/A>
traceback : To enable traceback see: Error Propagation — PyTorch 2.3 documentation

smartedge:949125:949230 [0] NCCL INFO Call to ibv_set_ece failed with error Operation not supported errno 95

smartedge:949125:949230 [0] misc/ibvwrap.cc:206 NCCL WARN Call to ibv_modify_qp failed with error Invalid argument errno 22

These errors look concerning but I don’t know what the resolution would be. It looks like something with infiniband (ibv are infiniband verbs), so maybe look into how that is set up or you can try disabling infiniband to see if that is the culprit. NCCL experts may know more so you can make an issue in that repository (Issues · NVIDIA/nccl · GitHub)

1 Like

I tried to disable also the InfiniBand but still didn’t work, even I tried to set specific interface that I was sure that is connected directly but still the same problem! I updated everything and all the versions are same in both server but I am a bit lost what can I do to fix this nccl problem

Hi, I met the same problem, did you solve this one?

Unfortunately that problem didn’t solve because it might be problem of the whole server environment that i can not change everything and make two exactly equal as they were physical servers, but I made another solution to make the connection as two servers got problem recognizing each other, so I used docker swarm simply to make two nodes and made a overlay network between them (also it is default for it but you can customize ) then after making this change now two servers can communicate to each other without any problem! I hope it will help you