Hi all,
I’m trying to use torch.distributed.launch
with NCCL backend on two nodes each of them has single GPU. When I see here, it guides me to set torch.cuda.set_device(local_rank)
, however, each node has only device 0 available. So I’m confused torch.cuda.set_device(0) for both process is correct or not.
Either of them I met an error like this:
Traceback (most recent call last):
File "batch_train.py", line 26, in <module>
m.batch_train(argv[1:])
File "/u3/jbaik/pytorch-asr/asr/models/deepspeech_ctc/train.py", line 56, in batch_train
trainer = NonSplitTrainer(model, **vars(args))
File "/u3/jbaik/pytorch-asr/asr/models/trainer.py", line 93, in __init__
self.model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)
File "/home/jbaik/.pyenv/versions/3.7.0/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 134, in __init__
self.broadcast_bucket_size)
File "/home/jbaik/.pyenv/versions/3.7.0/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 251, in _dist_broadcast_coalesced
dist.broadcast(flat_tensors, 0)
File "/home/jbaik/.pyenv/versions/3.7.0/lib/python3.7/site-packages/torch/distributed/__init__.py", line 279, in broadcast
return torch._C._dist_broadcast(tensor, src, group)
RuntimeError: NCCL error in: /u3/setup/pytorch/pytorch/torch/lib/THD/base/data_channels/DataChannelNccl.cpp:322, unhandled system error