Distributed Data Parallel single node maximum number of GPUs

DistributedDataParallel imagenet training example breaks throwing the following error: RuntimeError: NCCL error in: /tmp/pip-req-build-4baxydiv/torch/lib/c10d/ProcessGroupNCCL.cpp:400, unhandled cuda error on running it on a single node with 10 GPUs. The same runs perfectly fine as soon as the number of GPUs in the environment is set to 8. For DataParallel, somewhere it is mentioned that at present it does not run on more than 8 GPUs; however, I could not find similar info about DDP (I may have missed it). Moreover, as all the processes load their own module locally on each of the devices without a broadcast during initialization, is not it unexpected?

Which NCCL version are you using?
Could you rerun your script with NCCL_DEBUG=DEBUG python ... and post the log here?

Thanks for the response.
That flag does not generate much info other than the usual output. I reran the code with NCCL_DEBUG=INFO, below is the log (machine is named gpu123):

$ NCCL_DEBUG=INFO python main.py -a resnet18 --dist-url ‘tcp://127.0.0.1:6840’ --dist-backend ‘nccl’ --multiprocessing-distributed --world-size 1 --local_rank 0 ~/tiny-imagenet-200
Use GPU: 3 for training Use GPU: 5 for training Use GPU: 7 for training Use GPU: 9 for training Use GPU: 6 for training Use GPU: 1 for training Use GPU: 0 for training Use GPU: 4 for training => creating model ‘resnet18’ Use GPU: 8 for training => creating model ‘resnet18’ Use GPU: 2 for training => creating model ‘resnet18’ => creating model ‘resnet18’ => creating model ‘resnet18’ => creating model ‘resnet18’ => creating model ‘resnet18’ => creating model ‘resnet18’ => creating model ‘resnet18’ => creating model ‘resnet18’ gpu123:7932:7932 [0] NCCL INFO Bootstrap : Using [0]ib0:10.36.192.223<0> gpu123:7932:7932 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so). gpu123:7932:7932 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:10.36.192.223<0> NCCL version 2.4.8+cuda10.0 gpu123:7941:7941 [9] NCCL INFO Bootstrap : Using [0]ib0:10.36.192.223<0> gpu123:7936:7936 [4] NCCL INFO Bootstrap : Using [0]ib0:10.36.192.223<0> gpu123:7934:7934 [2] NCCL INFO Bootstrap : Using [0]ib0:10.36.192.223<0> gpu123:7933:7933 [1] NCCL INFO Bootstrap : Using [0]ib0:10.36.192.223<0> gpu123:7937:7937 [5] NCCL INFO Bootstrap : Using [0]ib0:10.36.192.223<0> gpu123:7935:7935 [3] NCCL INFO Bootstrap : Using [0]ib0:10.36.192.223<0> gpu123:7941:7941 [9] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so). gpu123:7936:7936 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so). gpu123:7934:7934 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so). gpu123:7933:7933 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so). gpu123:7937:7937 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so). gpu123:7935:7935 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so). gpu123:7940:7940 [8] NCCL INFO Bootstrap : Using [0]ib0:10.36.192.223<0> gpu123:7940:7940 [8] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so). gpu123:7941:7941 [9] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:10.36.192.223<0> gpu123:7935:7935 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:10.36.192.223<0> gpu123:7933:7933 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:10.36.192.223<0> gpu123:7936:7936 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:10.36.192.223<0> gpu123:7934:7934 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:10.36.192.223<0> gpu123:7937:7937 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:10.36.192.223<0> gpu123:7940:7940 [8] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:10.36.192.223<0> gpu123:7932:8011 [0] NCCL INFO Setting affinity for GPU 0 to 0fff gpu123:7941:8013 [9] NCCL INFO Setting affinity for GPU 9 to 0fff gpu123:7940:8025 [8] NCCL INFO Setting affinity for GPU 8 to 0fff gpu123:7934:8022 [2] NCCL INFO Setting affinity for GPU 2 to 0fff gpu123:7937:8023 [5] NCCL INFO Setting affinity for GPU 5 to 0fff gpu123:7935:8017 [3] NCCL INFO Setting affinity for GPU 3 to 0fff gpu123:7936:8020 [4] NCCL INFO Setting affinity for GPU 4 to 0fff gpu123:7933:8018 [1] NCCL INFO Setting affinity for GPU 1 to 0fff gpu123:7939:7939 [7] NCCL INFO Bootstrap : Using [0]ib0:10.36.192.223<0> gpu123:7939:7939 [7] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so). gpu123:7939:7939 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:10.36.192.223<0> gpu123:7938:7938 [6] NCCL INFO Bootstrap : Using [0]ib0:10.36.192.223<0> gpu123:7938:7938 [6] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so). gpu123:7938:7938 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:10.36.192.223<0> gpu123:7939:8027 [7] NCCL INFO Setting affinity for GPU 7 to 0fff gpu123:7938:8029 [6] NCCL INFO Setting affinity for GPU 6 to 0fff gpu123:7932:8011 [0] NCCL INFO Channel 00 : 0 1 2 3 4 5 6 7 8 9 gpu123:7935:8017 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via P2P/IPC gpu123:7938:8029 [6] NCCL INFO Ring 00 : 6[6] -> 7[7] via P2P/IPC gpu123:7936:8020 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via P2P/IPC gpu123:7937:8023 [5] NCCL INFO Ring 00 : 5[5] -> 6[6] via P2P/IPC gpu123:7939:8027 [7] NCCL INFO Ring 00 : 7[7] -> 8[8] via P2P/IPC gpu123:7941:8013 [9] NCCL INFO Ring 00 : 9[9] -> 0[0] via P2P/IPC gpu123:7940:8025 [8] NCCL INFO Ring 00 : 8[8] -> 9[9] via P2P/IPC gpu123:7933:8018 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC gpu123:7932:8011 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC gpu123:7934:8022 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC gpu123:7941:8013 [9] transport/p2p.cc:574 NCCL WARN failed to open CUDA IPC handle : 60 peer mapping resources exhausted gpu123:7941:8013 [9] NCCL INFO init.cc:669 -> 1 gpu123:7941:8013 [9] NCCL INFO init.cc:815 -> 1 gpu123:7941:8013 [9] NCCL INFO init.cc:951 -> 1 gpu123:7941:8013 [9] NCCL INFO misc/group.cc:69 -> 1 [Async thread] gpu123:7932:8011 [0] transport/p2p.cc:604 NCCL WARN failed to open CUDA IPC handle : 60 peer mapping resources exhausted gpu123:7932:8011 [0] NCCL INFO init.cc:679 -> 1 gpu123:7932:8011 [0] NCCL INFO init.cc:815 -> 1 gpu123:7932:8011 [0] NCCL INFO init.cc:951 -> 1 gpu123:7932:8011 [0] NCCL INFO misc/group.cc:69 -> 1 [Async thread] gpu123:7935:8017 [3] NCCL INFO comm 0x2abf58001e10 rank 3 nranks 10 cudaDev 3 nvmlDev 3 - Init COMPLETE gpu123:7934:8022 [2] NCCL INFO comm 0x2b8408001e10 rank 2 nranks 10 cudaDev 2 nvmlDev 2 - Init COMPLETE gpu123:7937:8023 [5] NCCL INFO comm 0x2b011c001e10 rank 5 nranks 10 cudaDev 5 nvmlDev 5 - Init COMPLETE gpu123:7936:8020 [4] NCCL INFO comm 0x2b4b28001e10 rank 4 nranks 10 cudaDev 4 nvmlDev 4 - Init COMPLETE gpu123:7938:8029 [6] NCCL INFO comm 0x2b58d8001e10 rank 6 nranks 10 cudaDev 6 nvmlDev 6 - Init COMPLETE gpu123:7933:8018 [1] NCCL INFO comm 0x2af070001e10 rank 1 nranks 10 cudaDev 1 nvmlDev 1 - Init COMPLETE gpu123:7939:8027 [7] NCCL INFO comm 0x2b2180001e10 rank 7 nranks 10 cudaDev 7 nvmlDev 7 - Init COMPLETE gpu123:7940:8025 [8] NCCL INFO comm 0x2b015c001e10 rank 8 nranks 10 cudaDev 8 nvmlDev 8 - Init COMPLETE /nfs/scistore08/alistgrp/bchatter/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 11 leaked semaphores to clean up at shutdown len(cache)) /nfs/scistore08/alistgrp/bchatter/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 11 leaked semaphores to clean up at shutdown len(cache)) /nfs/scistore08/alistgrp/bchatter/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 11 leaked semaphores to clean up at shutdown len(cache)) /nfs/scistore08/alistgrp/bchatter/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 11 leaked semaphores to clean up at shutdown len(cache)) /nfs/scistore08/alistgrp/bchatter/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 11 leaked semaphores to clean up at shutdown len(cache)) /nfs/scistore08/alistgrp/bchatter/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 11 leaked semaphores to clean up at shutdown len(cache)) Traceback (most recent call last): File “main.py”, line 425, in main() File “main.py”, line 109, in main mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) File “/nfs/scistore08/alistgrp/bchatter/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py”, line 171, in spawn while not spawn_context.join(): File “/nfs/scistore08/alistgrp/bchatter/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py”, line 118, in join raise Exception(msg) Exception: – Process 9 terminated with the following error: Traceback (most recent call last): File “/nfs/scistore08/alistgrp/bchatter/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py”, line 19, in _wrap fn(i, *args) File “/nfs/scistore08/alistgrp/bchatter/workspace/async-opt/dist_data_parallel/imagenet_training/main.py”, line 151, in main_worker model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu]) File “/nfs/scistore08/alistgrp/bchatter/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py”, line 298, in init self.broadcast_bucket_size) File “/nfs/scistore08/alistgrp/bchatter/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py”, line 480, in _distributed_broadcast_coalesced dist._broadcast_coalesced(self.process_group, tensors, buffer_size) RuntimeError: NCCL error in: /tmp/pip-req-build-4baxydiv/torch/lib/c10d/ProcessGroupNCCL.cpp:400, unhandled cuda error

torch.cuda.nccl.version() outputs 2408.

Hey @bapi

For DataParallel, somewhere it is mentioned that at present it does not run on more than 8 GPUs;

Just curious, could you please point me to the doc with this claim? This is new to me, I wasn’t aware there is such a limitation in DP.

however, I could not find similar info about DDP (I may have missed it).

We recently tested DDP using 256 GPUs, and it runs fine. Could this error be sth specific to the imagenet example? cc @fmassa for vision questions

Moreover, as all the processes load their own module locally on each of the devices without a broadcast during initialization, is not it unexpected?

There is a broadcast in DDP ctor. Please see the link below:

Hi @mrshenli, (sorry for such a late read/response to this message)

So, I figured out that using either the flag NCCL_P2P_LEVEL=0 or NCCL_P2P_DISABLE=1, DDP runs fine on a machine with >8 GPUs. Here I am specifically talking about 10 GPUs in the same machine. I am not sure about the topology of the 256 GPUs that you mentioned.

Right now, I can not locate the doc page where I had seen that nn.dataparallel does not run (efficiently or something?) on more than 8 GPUs in a machine. Maybe it was in the previous version of the doc? In any case, I will confirm this one by testing it on a machine with 10 GPUs that I have access to.

Thanks.

1 Like

In that test, each node only has 8 GPUs.

So, I figured out that using either the flag NCCL_P2P_LEVEL=0 or NCCL_P2P_DISABLE=1, DDP runs fine on a machine with >8 GPUs.

I see. We don’t have tests covering > 8GPUs per node cases yet. This is an important message, thanks for sharing!

1 Like