When should we use torch.multiprocessing.spawn?

Hello. I looked through some tutorials about DistributedDataParallel. Some of them use the spawn module however others said spawn should not be used (for example, this page, " 1. It must provide an entry-point function for a single worker. For example, it should not launch subprocesses using torch.multiprocessing.spawn").
This make me very confused.

Besides, I have some other questions.

  1. If I have a single GPU, do I have to use ONE process when I call torchrun ? Can I use more than one process ?
  2. In whcih case, will I need --rdzv_id, --rdzv_backend and --rdzv_endpoint ? Do I have to use them on multiple nodes and multiple GPUs or single node and multiple GPUs or single node and CPU with multiple cores ?
  3. If I use DistributedDataParallel. on a slurm system, each node has a CPU with many cores, for example, 32. How should I use torchrun ?
    and in this case, my script should be
dist.init_process_group("gloo") #do not use rank=rank, world_size=world_size
model = ToyModel() #do not use .to(rank)
ddp_model = DDP(model) #do not use device_ids=[rank]

is this right ?

  1. Since DDP uses a single process per device, you would also use one process for your single GPU or you could also just ignore torchrun and execute your single-GPU script directly.
  2. These rdzv settings are used for multi-node runs and can be used to launch multiple runs on a single node as seen here.
  3. I don’t know how pure CPU DDP works.