DDP hangs upon creation

Hi.
I’m trying to use DDP on two nodes, but the DDP creation hangs forever. The code is like this:

import torch
import torch.nn as nn
import torch.distributed as dist
import os
from torch.nn.parallel import DistributedDataParallel as DDP
import datetime

os.environ['MASTER_ADDR']='$myip'
os.environ['MASTER_PORT']='7777'
# os.environ['NCCL_BLOCKING_WAIT']='1'
os.environ['NCCL_ASYNC_ERROR_HANDLING']='1'


class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 5)

    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))

The following lines are different for each node:

dist.init_process_group(backend='nccl', timeout=datetime.timedelta(0, 10), world_size=2, rank=0) # rank=0 for $myip node, rank=1 for the other node

model = ToyModel().to(0)
ddp_model = DDP(model, device_ids=[0], output_device=0) # This is where hangs.

One of the nodes would show this:

In [4]: model = ToyModel().to(0)
   ...: ddp_model = DDP(model, device_ids=[0], output_device=0)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-7fbd4245ff44> in <module>
      1 model = ToyModel().to(0)
----> 2 ddp_model = DDP(model, device_ids=[0], output_device=0)

~/bin/anaconda3/envs/torch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py in __init__(self, module, device_ids, output_device, dim, broadcast_buffers, process_group, bucket_cap_mb, find_unused_parameters, check_reduction, gradient_as_bucket_view)
    576         parameters, expect_sparse_gradient = self._build_params_for_reducer()
    577         # Verify model equivalence.
--> 578         dist._verify_model_across_ranks(self.process_group, parameters)
    579         # Sync params and buffers. Ensures all DDP models start off at the same value.
    580         self._sync_params_and_buffers(authoritative_rank=0)

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1634272172048/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled system error, NCCL version 21.0.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

Any advices? Thanks~

Hey @Musoy_King, looks like NCCL broadcast crashed. Can you try if directly calling dist.broadcast would fail too?

Also, looks like you are using ipython or notebook. Can you try to directly use python to run the script on the two nodes?

Hi, I’ve encountered the exact same problem.
Can you share how you have dealt with this error?

Can you run with NCCL_DEBUG=INFO and share the logs? That would provide more detailed information about what went wrong.