The multiprocessing and distributed confusing me a lot when I’m reading some code
#the main function to enter
def main_worker(rank,cfg):
trainer=Train(rank,cfg)
if __name__=='_main__':
torch.mp.spawn(main_worker,nprocs=cfg.gpus,args=(cfg,))
#here is a slice of Train class
class Train():
def __init__(self,rank,cfg):
#nothing special
if cfg.dist:
#forget the indent problem cause I can't make it aligned
dist.init_process_group(backend='gloo',rank=self.rank,world_size=self.config.gpus,init_method="file://./sharefile",timeout=datetime.timedelta(seconds=5))
#init the model
self.model=torch.nn.Parallel.DistributedDataParallel(self.model,device_ids=[self.gpus])
Cause I’m new to the distributed coding,Here are sth I can’t understand.
1
From my perspective, the spawn fcn start a job by initializing a group of procs with arg(rank,cfg), which means a groups of main_worker funtion. so every single main_worker,as a process, maintains a trainer object.But I see insde the trainer object, dist package also init process group to init a group of processes. What is the difference?where lies the true process?are there two hirearchy groups?
2
the dist.init_process_method reports runtime errors.When I choose nccl as backend,it runs and waits until reporting (no nccl module),How to install an nccl module? And I changed it to ‘gloo’,It runs and reports a runtime error with no message,Thus it is not able to run.How to debug?
3
The init_process_group receives an argument init_method, which I can hardly understand,I leave it None but it bugged a message (No rendezvous handler for env://) ,I followed the tutorial to set environments variables MASTER_ADDR MASTER_PORT, it seems no effect.So I have to change it to ‘file://./sharefile’,what is this ?I totoally messed.
I
4
Now I’m running the code on my laptop,a single machine node with 1 GPU.But I need to train my model on a single machine node with multiple GPU cards.Is there any code to change?
Hope Someone can solve my problem or, maybe leave a tutorial or example,something comprehensive,not the simple sample.
Thank u so much.