When should we use torch.multiprocessing.spawn?

Smith_Jack · June 19, 2024, 7:33am

Hello. I looked through some tutorials about DistributedDataParallel. Some of them use the spawn module however others said spawn should not be used (for example, this page, " 1. It must provide an entry-point function for a single worker. For example, it should not launch subprocesses using torch.multiprocessing.spawn").
This make me very confused.

Besides, I have some other questions.

If I have a single GPU, do I have to use ONE process when I call torchrun ? Can I use more than one process ?
In whcih case, will I need --rdzv_id, --rdzv_backend and --rdzv_endpoint ? Do I have to use them on multiple nodes and multiple GPUs or single node and multiple GPUs or single node and CPU with multiple cores ?
If I use DistributedDataParallel. on a slurm system, each node has a CPU with many cores, for example, 32. How should I use torchrun ?
and in this case, my script should be

dist.init_process_group("gloo") #do not use rank=rank, world_size=world_size
model = ToyModel() #do not use .to(rank)
ddp_model = DDP(model) #do not use device_ids=[rank]

is this right ?

ptrblck · June 19, 2024, 2:44pm

Since DDP uses a single process per device, you would also use one process for your single GPU or you could also just ignore torchrun and execute your single-GPU script directly.
These rdzv settings are used for multi-node runs and can be used to launch multiple runs on a single node as seen here.
I don’t know how pure CPU DDP works.