Request an example to use DistributedDataParallel on multiple CPUs and GPU

Hello, I am trying to use DistributedDataParallel module to parallel the model on multiple CPUs or a single GPU. I have read some tutorials on pytorch.org and also some codes written by others (e.g. REANN), but now I am confused on how to use the DistributedDataParallel module.

For example, in the tutorial, I see the following code

import torch.multiprocessing as mp
def run_demo(demo_fn, world_size):
    mp.spawn(demo_fn,
             args=(world_size,),
             nprocs=world_size,
             join=True)

But I didn’t find this part in the REANN code (it also uses the DistributedDataParallel module, but it doesn’t use the multiprocessing module).
Another confusing example is

os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
dist.init_process_group("gloo", rank=rank, world_size=world_size)

In which case do the parameters rank and word_size should be used ? In the REANN code, I only see this

dist.init_process_group(backend=DDP_backend)

That is, I only need to set the backend whether CPU or GPU is used.
There are also other aspects which make me confused, for example, 1) whether I have to divide up the dataset manually and send them to every process, 2) When shoud I use the to() method on a tensor or model to assign a device…

So, are there some standard example about using the DistributedDataParallel module on CPUs and GPU ? Thanks.