I am new to pytorch and unable to understand how Torch.distributed.init_process_group works. I am unable to understand the The documentation and not getting anything good over Google. Please help me.
Any help will be appreciated. Please
Hi, check out this this tutorial.
https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Distributed , This link will open a project where a file multiproc.py is calling through, subprocess.Popen() the main.py file. Where the init_process_group() method is initialized and torch.nn.parallel.DistributedDataParallel() method is used. Can you explain me the pipeline here? How the different copies of model in different GPU’s are handled and how the data parallelism is handled by these two scripts?