Pytorch isn't working with DistributedDataParallel for multi gpu training

i_m_a_q · February 18, 2023, 1:03pm

I am trying to train my model on multiple GPUS. I used the libraries and a added a code for it

from torch.nn.parallel import DistributedDataParallel as DDP
from torch.distributed import init_process_group, destroy_process_group

Initialization

def ddp_setup(rank: int, world_size: int):
   os.environ["MASTER_ADDR"] = "localhost"
   os.environ["MASTER_PORT"] = "12355"
   os.environ["TORCH_DISTRIBUTED_DEBUG"]="DETAIL"
   init_process_group(backend="gloo", rank=0, world_size=1)

my model

 model = CMGCNnet(config,
                     que_vocabulary=glovevocabulary,
                     glove=glove,
                     device=device)

    model = model.to(0)

    if -1 not in args.gpu_ids and len(args.gpu_ids) > 1:
       model = DDP(model, device_ids=[0,1])

it throws following error:

config_yml : model/config_fvqa_gruc.yml
cpu_workers : 0
save_dirpath : exp_test_gruc
overfit : False
validate : True
gpu_ids : [0, 1]
dataset : fvqa
Loading FVQATrainDataset…
True
done splitting
Loading FVQATestDataset…
Loading glove…
Building Model…
Traceback (most recent call last):
File “trainfvqa_gruc.py”, line 512, in
train()
File “trainfvqa_gruc.py”, line 145, in train
ddp_setup(0,1)
File “trainfvqa_gruc.py”, line 42, in ddp_setup
init_process_group(backend=“gloo”, rank=0, world_size=1)
File “/home/seecs/miniconda/envs/mucko-edit/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 360, in init_process_group
timeout=timeout)
RuntimeError: [enforce fail at /opt/conda/conda-bld/pytorch_1544202130060/work/third_party/gloo/gloo/transport/tcp/device.cc:128] rp != nullptr. Unable to find address for: 127.0.0.1localhost.
localdomainlocalhost

I tried printing the issue with os.environ["TORCH_DISTRIBUTED_DEBUG"]="DETAIL"
it outputs:

Loading FVQATrainDataset...
True
done splitting
Loading FVQATestDataset...
Loading glove...
Building Model...
**Segmentation fault**

with NCCL background it starts the training but get stuck and doesn’t go further than this

Training for epoch 0:
0%| | 0/2039 [00:00<?, ?it/s]

I found this solution but where to add these lines?
*GLOO_SOCKET_IFNAME** , for example export GLOO_SOCKET_IFNAME=eth0`
mentioned in
(Runtime error using Distributed with gloo - #2 by ifgovh)

Can someone help me with this issue?