What the DDP wrapper do before pass args into self.module?

siesta · May 9, 2022, 11:58am

in the newest version of pytorch, i notice that the DDP wrapper recursively convert all tensors to cuda tensor when use multigpu training. this behavior is mostly expected.
but recently, i need some moudels use eval mode in my model even in training (mainly for batchnorm), i just code:

# model has been wrapped by DDP
model.eval()
model.training = True

since there’s only batchnorm that behaves different between training and testing, so the code above should work properly.

i manually set the flag model.training to True because some of my forward code depends on it. but every time step into training forward, this flag automatically become False.
since the ddp source code is complex, i just wonder what the DDP do before passing args (also kwargs) to model? is there any flags or behavior not mentioned in docs?

thanks!

mrshenli · May 9, 2022, 6:54pm

Hey @siesta, I don’t recall DDP implicitly set .trianing to False
in forward. The code below is what happens before calling forward on the original model.

github.com

pytorch/pytorch/blob/e5915a2216972592ab640dcc0635f0bc7ea8c061/torch/nn/parallel/distributed.py#L969-L1002

      
        
            def forward(self, *inputs, **kwargs):
                with torch.autograd.profiler.record_function("DistributedDataParallel.forward"):
                    if torch.is_grad_enabled() and self.require_backward_grad_sync:
                        self.logger.set_runtime_stats_and_log()
                        self.num_iterations += 1
                        self.reducer.prepare_for_forward()
            
            
        # Notify the join context that this process has not joined, if
                    # needed
                    work = Join.notify_join_context(self)
                    if work:
                        self.reducer._set_forward_pass_work_handle(
                            work, self._divide_by_initial_world_size
                        )
            
            
        # Calling _rebuild_buckets before forward compuation,
                    # It may allocate new buckets before deallocating old buckets
                    # inside _rebuild_buckets. To save peak memory usage,
                    # call _rebuild_buckets before the peak memory usage increases
                    # during forward computation.

This file has been truncated. show original

Is there a repro that we can dig into?

siesta · May 16, 2022, 1:08am

thanks for reply! i’m checking and rearrange my code. i’ll reply this thread when repo published and also check this problem can be reoccurred~