A few questions and comments about the distributed API
- Although CUDA tensors are pretty much not supported for tcp, gloo, or mpi, after receiving the torch tensors, can we still convert them with
.cuda()
and use them locally? - Typo in the
torch.distributed.get_rank()
API: “Rank is a unique identifier assigned to each process withing a distributed group” - Do all machines have to be on the same network? I didn’t catch this from reading the API or the GitHub release note, but seems like I can’t train across networks e.g. one machine in Massachusetts and another in New York.
3.5. Can members of a group be on different networks? - Just to see if my understanding is correct:
You start a distributed training session withtorch.distributed.init_process_group(backend)
. The backend parameter specifies the network protocol. Then wrap the network withtorch.nn.parallel.DistributedDataParallel(model)
. Setup the dataset and wrap it intorch.utils.data.distributed.DistributedSampler(train_dataset)
, and setup everything else as usual and train normally?
Thanks.