Hello, I am trying to use DistributedDataParallel module to parallel the model on multiple CPUs or a single GPU. I have read some tutorials on pytorch.org and also some codes written by others (e.g. REANN), but now I am confused on how to use the DistributedDataParallel module.
For example, in the tutorial, I see the following code
import torch.multiprocessing as mp
def run_demo(demo_fn, world_size):
mp.spawn(demo_fn,
args=(world_size,),
nprocs=world_size,
join=True)
But I didn’t find this part in the REANN code (it also uses the DistributedDataParallel module, but it doesn’t use the multiprocessing module).
Another confusing example is
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
dist.init_process_group("gloo", rank=rank, world_size=world_size)
In which case do the parameters rank and word_size should be used ? In the REANN code, I only see this
dist.init_process_group(backend=DDP_backend)
That is, I only need to set the backend whether CPU or GPU is used.
There are also other aspects which make me confused, for example, 1) whether I have to divide up the dataset manually and send them to every process, 2) When shoud I use the to() method on a tensor or model to assign a device…
So, are there some standard example about using the DistributedDataParallel module on CPUs and GPU ? Thanks.