I am learning about torch.nn.DistribuitedDataParallel. Compared with naive nn.DataParallel, it seems that there are many new concepts introduced to this function. After reading some programs, I got a bit confused.
In the beginning of the programs, there is always some
torch.cuda.set_device(arg.local_rank). What does this mean? Does this play same role as the system variable of
Besides, there is also a line of
torch.distributed.init_process_group(world_size=4). What does
world_size mean here? Does this mean the number of new processes created to train the model?
Lastly, could you provide a brief example on how I could train a model with 4 gpus on a single machine. I used to employ
nn.DataParallel, and how I could do it with
DistributedDataParallel ? Must I also wrap my dataset with a