What does world size mean and what does set_device do


I am learning about torch.nn.DistribuitedDataParallel. Compared with naive nn.DataParallel, it seems that there are many new concepts introduced to this function. After reading some programs, I got a bit confused.

In the beginning of the programs, there is always some torch.cuda.set_device(arg.local_rank). What does this mean? Does this play same role as the system variable of CUDA_VISIBLE_DEVICES=1,2,3,4 ?

Besides, there is also a line of torch.distributed.init_process_group(world_size=4). What does world_size mean here? Does this mean the number of new processes created to train the model?

Lastly, could you provide a brief example on how I could train a model with 4 gpus on a single machine. I used to employ nn.DataParallel, and how I could do it with DistributedDataParallel ? Must I also wrap my dataset with a torch.utils.data.distributed.DistributedSampler?

Hi @coincheung

Do you still need help on this one?

i need help on this, could you please explain?

No thanks, I tried to figure it out myself.

Hi @enjlee and @coincheung
Sorry for my late reply. I wrote a blog post sometime ago, that may shed some light on this issue. The blog post explains how to setup not only data parallel on single machine, but also separate a large model across multiple GPUs, and multiple replica on different nodes, and train them in a distributed fashion. In other words, each process group contains more than one GPU.
It might be an overkill, but I think it may better explain the underlining mechanism of how pytorch distributed works.

Please let me know if you have any questions after reading the blog post.

1 Like