Hello. I looked through some tutorials about DistributedDataParallel. Some of them use the spawn module however others said spawn should not be used (for example, this page, " 1. It must provide an entry-point function for a single worker. For example, it should not launch subprocesses using torch.multiprocessing.spawn
").
This make me very confused.
Besides, I have some other questions.
- If I have a single GPU, do I have to use ONE process when I call torchrun ? Can I use more than one process ?
- In whcih case, will I need
--rdzv_id
,--rdzv_backend
and--rdzv_endpoint
? Do I have to use them on multiple nodes and multiple GPUs or single node and multiple GPUs or single node and CPU with multiple cores ? - If I use DistributedDataParallel. on a slurm system, each node has a CPU with many cores, for example, 32. How should I use torchrun ?
and in this case, my script should be
dist.init_process_group("gloo") #do not use rank=rank, world_size=world_size
model = ToyModel() #do not use .to(rank)
ddp_model = DDP(model) #do not use device_ids=[rank]
is this right ?