Confusing about distributed and muiltiprocessing

Kevinkevin189 · December 3, 2020, 9:26am

The multiprocessing and distributed confusing me a lot when I’m reading some code

#the main function to enter
def main_worker(rank,cfg):
    trainer=Train(rank,cfg)
if __name__=='_main__':
    torch.mp.spawn(main_worker,nprocs=cfg.gpus,args=(cfg,))

#here is a slice of Train class
class Train():
    def __init__(self,rank,cfg):
         #nothing special
         if cfg.dist:
#forget the indent problem cause I can't make it aligned
dist.init_process_group(backend='gloo',rank=self.rank,world_size=self.config.gpus,init_method="file://./sharefile",timeout=datetime.timedelta(seconds=5))
#init the model
self.model=torch.nn.Parallel.DistributedDataParallel(self.model,device_ids=[self.gpus])

Cause I’m new to the distributed coding,Here are sth I can’t understand.

1
From my perspective, the spawn fcn start a job by initializing a group of procs with arg(rank,cfg), which means a groups of main_worker funtion. so every single main_worker,as a process, maintains a trainer object.But I see insde the trainer object, dist package also init process group to init a group of processes. What is the difference?where lies the true process?are there two hirearchy groups?
2
the dist.init_process_method reports runtime errors.When I choose nccl as backend,it runs and waits until reporting (no nccl module),How to install an nccl module? And I changed it to ‘gloo’,It runs and reports a runtime error with no message,Thus it is not able to run.How to debug?
3
The init_process_group receives an argument init_method, which I can hardly understand,I leave it None but it bugged a message (No rendezvous handler for env://) ,I followed the tutorial to set environments variables MASTER_ADDR MASTER_PORT, it seems no effect.So I have to change it to ‘file://./sharefile’,what is this ?I totoally messed.
I
4
Now I’m running the code on my laptop,a single machine node with 1 GPU.But I need to train my model on a single machine node with multiple GPU cards.Is there any code to change?

Hope Someone can solve my problem or, maybe leave a tutorial or example,something comprehensive,not the simple sample.
Thank u so much.

Kevinkevin189 · December 3, 2020, 10:17am

I also copied the code sample from pytorch official tutorials

#!/usr/bin/env python
import os
import torch
import torch.distributed as dist
from torch.multiprocessing import Process

def run(rank, size):
    """ Distributed function to be implemented later. """
    pass

def init_process(rank, size, fn, backend='gloo'):
    """ Initialize the distributed environment. """
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(backend, rank=rank, world_size=size)
    fn(rank, size)


if __name__ == "__main__":
    size = 2
    processes = []
    for rank in range(size):
        p = Process(target=init_process, args=(rank, size, run))
        p.start()
        processes.append(p)

    for p in processes:
        p.join()

this tutorial link is here distributed_tutorial_link
But, I still got error.
And this code can only run a part by setting the init_method = ‘file://./sharedfile’.
I got the error

Process Process-1:
Traceback (most recent call last):
  File "C:\Anaconda3\envs\pt\lib\multiprocessing\process.py", line 297, in _bootstrap
    self.run()
  File "C:\Anaconda3\envs\pt\lib\multiprocessing\process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\Kevin\Documents\Scripts\inpaint\test.py", line 13, in init_process
    dist.init_process_group('gloo',rank=rank,world_size=size,init_method='file://./sharedfile')
  File "C:\Anaconda3\envs\pt\lib\site-packages\torch\distributed\distributed_c10d.py", line 433, in init_process_group
    timeout=timeout)
  File "C:\Anaconda3\envs\pt\lib\site-packages\torch\distributed\distributed_c10d.py", line 508, in _new_process_group_helper
    timeout=timeout)
RuntimeError
Process Process-2:
Traceback (most recent call last):
  File "C:\Anaconda3\envs\pt\lib\multiprocessing\process.py", line 297, in _bootstrap
    self.run()
  File "C:\Anaconda3\envs\pt\lib\multiprocessing\process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\Kevin\Documents\Scripts\inpaint\test.py", line 13, in init_process
    dist.init_process_group('gloo',rank=rank,world_size=size,init_method='file://./sharedfile')
  File "C:\Anaconda3\envs\pt\lib\site-packages\torch\distributed\distributed_c10d.py", line 433, in init_process_group
    timeout=timeout)
  File "C:\Anaconda3\envs\pt\lib\site-packages\torch\distributed\distributed_c10d.py", line 508, in _new_process_group_helper
    timeout=timeout)
RuntimeError

Process finished with exit code 0

Is there anything wrong with this code sample or my laptop?

pritamdamania87 · December 4, 2020, 11:13pm

torch.mp.spawn spawns the actual processes, init_process_group doesn’t create any new processes but just initializes the distributed communication between spawned processes. For example if you spawn 4 processes using mp.spawn and call init_process_group on those 4 processes, init_process_group would ensure all 4 processes discover each other and now you can run collective calls like allreduce among those processes.

You can find the NCCL installation guide here: Installation Guide :: NVIDIA Deep Learning NCCL Documentation

This section of our docs goes over details of init_method and how to use it: Distributed communication package - torch.distributed — PyTorch 2.1 documentation. In particular setting MASTER_ADDR and MASTER_PORT and leaving init_method=None should work.

You would need some changes where you need to one process per GPU on your single machine node. You can use mp.spawn to create one process per GPU.

pritamdamania87 · December 4, 2020, 11:18pm

Kevinkevin189:

from torch.multiprocessing import Process

def run(rank, size):
    """ Distributed function to be implemented later. """
    pass

def init_process(rank, size, fn, backend='gloo'):
    """ Initialize the distributed environment. """
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(backend, rank=rank, world_size=size)
    fn(rank, size)


if __name__ == "__main__":
    size = 2
    processes = []
    for rank in range(size):
        p = Process(target=init_process, args=(rank, size, run))
        p.start()
        processes.append(p)

    for p in processes:
        p.join()

For this code, what is the error that you encountered?

Can you try something like file:///tmp/sharedfile instead? It looks like the file:// scheme is not handling ./ correctly.

Kevinkevin189 · December 6, 2020, 11:00am

Thanks for your replies.I now can understand the procs and distributed a little.

I found the true bug when run this tutor code on a linux machine.It seems the windows OS only support sharedfile inter-process communication method.And Because of the file permission issue,it reported a Runtime error with no error messege.It also reported a same Runtime error w/ a ‘Permission denied’ message.
Hope official team would fix this ‘bug’ (cause an error with no explicit messege is really annoying)

pritamdamania87 · December 8, 2020, 2:05am

I got an error as follows:

RuntimeError: open(/sharedfile): Permission denied

Doesn’t this error message clearly indicate the problem?

Kevinkevin189 · December 8, 2020, 12:14pm

Thanks!
Cause I’ve got a blank message on my laotop (Windows 10 Professional).And I’ve got it after I transfer my code to a linux OS machine.