Problem in DataPrallel for more than 2 GPUs

I had a working model. It ran and was very consistent with its result regarding the number of GPUs.
This means it already gave similar results when running on 1-4 GPUs (all combinations had been tested).
Now a came across some weird issue, Everything works fine on single or double GPUs (done with ““model = nn.DataParallel(model, device_ids=devices)””).
If I try to use 3 GPUs it crashes with NCCL Error 2: unhandled system error.
Moreover, if I use 4 GPUs it just gets stuck, ctrl+c doesn’t help of course (luckily I’m working via terminal and ssh so I just break the pipe to kill the run).

Could you rerun the 3 GPU run via NCCL_DEBUG=INFO python scripy.py args and see, if NCCL reports any errors?

The first row is my print command

Initiating epoch 0
5f8f287430a2:27239:27239 [0] NCCL INFO Bootstrap : Using [0]eth0:172.17.0.2<0>
5f8f287430a2:27239:27239 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

5f8f287430a2:27239:27239 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
5f8f287430a2:27239:27239 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
5f8f287430a2:27239:27239 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.2
5f8f287430a2:27239:27499 [0] NCCL INFO Channel 00/02 : 0 1 2
5f8f287430a2:27239:27500 [1] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/64
5f8f287430a2:27239:27501 [2] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/64
5f8f287430a2:27239:27499 [0] NCCL INFO Channel 01/02 : 0 1 2
5f8f287430a2:27239:27500 [1] NCCL INFO Trees [0] 2/-1/-1->1->0|0->1->2/-1/-1 [1] 2/-1/-1->1->0|0->1->2/-1/-1
5f8f287430a2:27239:27501 [2] NCCL INFO Trees [0] -1/-1/-1->2->1|1->2->-1/-1/-1 [1] -1/-1/-1->2->1|1->2->-1/-1/-1
5f8f287430a2:27239:27500 [1] NCCL INFO Setting affinity for GPU 1 to ffff,0000ffff
5f8f287430a2:27239:27501 [2] NCCL INFO Setting affinity for GPU 3 to ffff0000,ffff0000
5f8f287430a2:27239:27499 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/64
5f8f287430a2:27239:27499 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->-1|-1->0->1/-1/-1
5f8f287430a2:27239:27499 [0] NCCL INFO Setting affinity for GPU 0 to ffff,0000ffff
5f8f287430a2:27239:27500 [1] NCCL INFO Could not enable P2P between dev 1(=5e000) and dev 0(=3b000)

5f8f287430a2:27239:27499 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device

5f8f287430a2:27239:27500 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device

5f8f287430a2:27239:27501 [2] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
5f8f287430a2:27239:27499 [0] NCCL INFO include/shm.h:41 -> 2
5f8f287430a2:27239:27500 [1] NCCL INFO include/shm.h:41 -> 2
5f8f287430a2:27239:27501 [2] NCCL INFO include/shm.h:41 -> 2

5f8f287430a2:27239:27499 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-ea821c4e67e70b28-0-2-0 (size 9637888)

5f8f287430a2:27239:27500 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-ea821c4e67e70b28-0-0-1 (size 9637888)

5f8f287430a2:27239:27499 [0] NCCL INFO transport/shm.cc:101 -> 2
5f8f287430a2:27239:27500 [1] NCCL INFO transport/shm.cc:101 -> 2

5f8f287430a2:27239:27501 [2] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-ea821c4e67e70b28-0-1-2 (size 9637888)

5f8f287430a2:27239:27499 [0] NCCL INFO transport.cc:30 -> 2
5f8f287430a2:27239:27501 [2] NCCL INFO transport/shm.cc:101 -> 2
5f8f287430a2:27239:27500 [1] NCCL INFO transport.cc:30 -> 2
5f8f287430a2:27239:27501 [2] NCCL INFO transport.cc:30 -> 2
5f8f287430a2:27239:27499 [0] NCCL INFO transport.cc:49 -> 2
5f8f287430a2:27239:27500 [1] NCCL INFO transport.cc:49 -> 2
5f8f287430a2:27239:27501 [2] NCCL INFO transport.cc:49 -> 2
5f8f287430a2:27239:27499 [0] NCCL INFO init.cc:766 -> 2
5f8f287430a2:27239:27501 [2] NCCL INFO init.cc:766 -> 2
5f8f287430a2:27239:27500 [1] NCCL INFO init.cc:766 -> 2
5f8f287430a2:27239:27501 [2] NCCL INFO init.cc:840 -> 2
5f8f287430a2:27239:27500 [1] NCCL INFO init.cc:840 -> 2
5f8f287430a2:27239:27499 [0] NCCL INFO init.cc:840 -> 2
5f8f287430a2:27239:27501 [2] NCCL INFO group.cc:73 -> 2 [Async thread]
5f8f287430a2:27239:27500 [1] NCCL INFO group.cc:73 -> 2 [Async thread]
5f8f287430a2:27239:27499 [0] NCCL INFO group.cc:73 -> 2 [Async thread]
5f8f287430a2:27239:27239 [0] NCCL INFO init.cc:906 -> 2
Traceback (most recent call last):
File “main_comb.py”, line 75, in
[trloss,tstloss,tstevm,inf_time]=cnn.run_epoch(epoch,show=True)
File “/common_space_docker/storage_4TSSD/ben/ce/Results/Nov_19_20_11:02_3DFF_base/cnn_imp.py”, line 875, in run_epoch
train_loss = train_model()
File “/common_space_docker/storage_4TSSD/ben/ce/Results/Nov_19_20_11:02_3DFF_base/cnn_imp.py”, line 762, in train_model
outputs = model(channel)
File “/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py”, line 727, in _call_impl
result = self.forward(*input, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py”, line 160, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File “/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py”, line 165, in replicate
return replicate(module, device_ids, not torch.is_grad_enabled())
File “/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/replicate.py”, line 88, in replicate
param_copies = _broadcast_coalesced_reshape(params, devices, detach)
File “/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/replicate.py”, line 71, in _broadcast_coalesced_reshape
tensor_copies = Broadcast.apply(devices, *tensors)
File “/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/_functions.py”, line 22, in forward
outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
File “/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/comm.py”, line 56, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: NCCL Error 2: unhandled system error

These are 3 TITAN GPUs with more than enough space (a batch and the model is about 4Gb of 25Gb available on each TITAN)

The error seems to come from NCCL not being able to use shared memory:

NCCL WARN Call to posix_fallocate failed : No space left on device
[...]
NCCL WARN Error while creating shared memory segment nccl-shm-recv-ea821c4e67e70b28-0-2-0 (size 9637888)

so make sure your machine has enough shared memory and if you are using a docker container, use the --ipc=host argument (or specify the shared memory usage manually).

3 Likes

I am using a docker container on a server, though the server has 750Gb RAM. It means that there is over 500Gb free after loading the datasets.
via nvidia-smi, I can see that all GPUs get the data and it fails as soon as the last GPU gets its batch.
Also, I usually see that after dataparallel all GPUs get a replica of the model while currently, only the master GPU gets the model and the rest get it together with the batch (if it helps).

The error is raised by insufficient shared memory not system RAM. Since you are using a container, did you try the suggested argument?

@ptrblck you’re some kind of a magician!
–ipc flag worked it out, everything works with 1-4 devices!

1 Like

I am having a very similar issue in using nn.DataParallel. I am working on 3 Tesla V100 GPUs. When I use two GPUs the code runs perfectly. But when I use all three GPUs the code fails during the forward pass. There is no error information given. The only error message I get is “Killed”. Could you help me with one, please? I am not using any docker.

Killed usually indicates that the OS has killed the process e.g. due to insufficient host memory (RAM).
Could you check, if this could be the case for the 3GPU run, e.g. are you loading more data into the RAM in this use case?

@ptrblck Actually I have a lot RAM in excess (more than 300 GB). So, that cannot be the case. Also, I tried batch size=1, still didn’t work for the 3 GPU run.

Another thing I noticed: when I set num_workers=0 in DataLoader, it works perfectly for the 3 GPU run. But setting num_workers=0 increases the training time drastically so I don’t like it. Is there an alternate fix (i.e. to use 3 GPUs while keeping my num_workers>0)?

OK, in that case try to get a stacktrace by running the script in gdb via:

gdb --args python script.py args
...
run
...
bt

which should hopefully show what’s failing.

I did. It is actually “RuntimeError: DataLoader worker (pid 27772) is killed by signal: Killed.” Any fix for this?

Switching to pytorch 1.7.0 solved the issue.