Are instance variables shared or re-initialized in DDP mode?

I’m building a Trainer class that supports pytorch DDP(multi-gpu training). The structure is like this

class trainer:
    def __init__(self, ...):
        # initialize once here
        self.train_loader = ...
        self.model = ....
        self.optimizer = ...
    
    def train_loop(self, rank, world_size):
        # transfer to current rank
        self.model.to(rank)
        for data in train_loader:
            data = data.to(rank)

    def run(self):
    # spawn processes
    world_size = ...
    torch.multiprocessing.spawn(self.train_loop,
                               args=(world_size,),
                      nprocs=world_size, join=True)

So the above code works. But no syntax error doesn’t mean that it’s doing what it’s supposed to do, especially with DDP.

Question:

  • Is the approach correct? I think that, unlike threading, multi-processing will spawn independent processes and all of them will have their copy of the trainer object. Is that correct?

Yes, you are right. You will get , for each rank, a completly independent instance of your class trainer, with all of it’s variables. Moreover, as a general rule, you won’t be able to share data / instances across ranks, since they live on different processes.

1 Like

Thanks so much. I just wanted to confirm this. I was able to build the trainer and DDP is working correctly as I can verify the speedup. But there is one strange thing - there are e extra rank 0 processes being created. Even with 4 GPUs, the rank 0 GPU gets 2 extra processes. Not sure why. Maybe you can think of something? Thanks!