I am trying to train my model on multiple GPUS. I used the libraries and a added a code for it
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.distributed import init_process_group, destroy_process_group
Initialization
def ddp_setup(rank: int, world_size: int):
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "12355"
os.environ["TORCH_DISTRIBUTED_DEBUG"]="DETAIL"
init_process_group(backend="gloo", rank=0, world_size=1)
my model
model = CMGCNnet(config,
que_vocabulary=glovevocabulary,
glove=glove,
device=device)
model = model.to(0)
if -1 not in args.gpu_ids and len(args.gpu_ids) > 1:
model = DDP(model, device_ids=[0,1])
it throws following error:
config_yml : model/config_fvqa_gruc.yml
cpu_workers : 0
save_dirpath : exp_test_gruc
overfit : False
validate : True
gpu_ids : [0, 1]
dataset : fvqa
Loading FVQATrainDataset…
True
done splitting
Loading FVQATestDataset…
Loading glove…
Building Model…
Traceback (most recent call last):
File “trainfvqa_gruc.py”, line 512, in
train()
File “trainfvqa_gruc.py”, line 145, in train
ddp_setup(0,1)
File “trainfvqa_gruc.py”, line 42, in ddp_setup
init_process_group(backend=“gloo”, rank=0, world_size=1)
File “/home/seecs/miniconda/envs/mucko-edit/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 360, in init_process_group
timeout=timeout)
RuntimeError: [enforce fail at /opt/conda/conda-bld/pytorch_1544202130060/work/third_party/gloo/gloo/transport/tcp/device.cc:128] rp != nullptr. Unable to find address for: 127.0.0.1localhost.
localdomainlocalhost
I tried printing the issue with os.environ["TORCH_DISTRIBUTED_DEBUG"]="DETAIL"
it outputs:
Loading FVQATrainDataset...
True
done splitting
Loading FVQATestDataset...
Loading glove...
Building Model...
**Segmentation fault**
with NCCL background it starts the training but get stuck and doesn’t go further than this
Training for epoch 0:
0%| | 0/2039 [00:00<?, ?it/s]
I found this solution but where to add these lines?
*GLOO_SOCKET_IFNAME** , for example
export GLOO_SOCKET_IFNAME=eth0`
mentioned in
(Runtime error using Distributed with gloo - #2 by ifgovh)
Can someone help me with this issue?