Are instance variables shared or re-initialized in DDP mode?

ayushexel · September 28, 2022, 7:56pm

I’m building a Trainer class that supports pytorch DDP(multi-gpu training). The structure is like this

class trainer:
    def __init__(self, ...):
        # initialize once here
        self.train_loader = ...
        self.model = ....
        self.optimizer = ...
    
    def train_loop(self, rank, world_size):
        # transfer to current rank
        self.model.to(rank)
        for data in train_loader:
            data = data.to(rank)

    def run(self):
    # spawn processes
    world_size = ...
    torch.multiprocessing.spawn(self.train_loop,
                               args=(world_size,),
                      nprocs=world_size, join=True)

So the above code works. But no syntax error doesn’t mean that it’s doing what it’s supposed to do, especially with DDP.

Question:

Is the approach correct? I think that, unlike threading, multi-processing will spawn independent processes and all of them will have their copy of the trainer object. Is that correct?

aazzolini · September 30, 2022, 1:01am

Yes, you are right. You will get , for each rank, a completly independent instance of your class trainer, with all of it’s variables. Moreover, as a general rule, you won’t be able to share data / instances across ranks, since they live on different processes.

ayushexel · October 1, 2022, 6:15pm

Thanks so much. I just wanted to confirm this. I was able to build the trainer and DDP is working correctly as I can verify the speedup. But there is one strange thing - there are e extra rank 0 processes being created. Even with 4 GPUs, the rank 0 GPU gets 2 extra processes. Not sure why. Maybe you can think of something? Thanks!