Pytorch v0.2 Distributed Questions

A few questions and comments about the distributed API

  1. Although CUDA tensors are pretty much not supported for tcp, gloo, or mpi, after receiving the torch tensors, can we still convert them with .cuda() and use them locally?
  2. Typo in the torch.distributed.get_rank() API: “Rank is a unique identifier assigned to each process withing a distributed group”
  3. Do all machines have to be on the same network? I didn’t catch this from reading the API or the GitHub release note, but seems like I can’t train across networks e.g. one machine in Massachusetts and another in New York.
    3.5. Can members of a group be on different networks?
  4. Just to see if my understanding is correct:
    You start a distributed training session with torch.distributed.init_process_group(backend). The backend parameter specifies the network protocol. Then wrap the network with torch.nn.parallel.DistributedDataParallel(model). Setup the dataset and wrap it in, and setup everything else as usual and train normally?