What's no_sync() exactly do in DDP

cxv0519 · January 14, 2023, 2:45pm

I tried DDP no_sync()

well…
I thought no_sync() make not to communicate among ddp group, but it seems not as far as I experimented.

It shows two different operating characteristics in three cases.

Case 1: forward, backward in same no_sync() like below

        if batch%2==0:
            with ddp_model.no_sync():
                pred = ddp_model(model_in)
                loss = loss_fn(pred, label)
                loss.backward()
        else:        
            pred = ddp_model(model_in)
            loss = loss_fn(pred, label)
            loss.backward()

It works well.
I mean It accumulates even gradients without synchronization.

Case 2: forward is out of no_sync() and backward is in no_sync() like below

        pred = ddp_model(model_in)
        if batch%2==0:
            with ddp_model.no_sync():
                loss = loss_fn(pred, label)
                loss.backward()
        else:        
            loss = loss_fn(pred, label)
            loss.backward()

Well…
In this case, It sync gradients!!

Case3: forward, backward is in no_sync(), but different context like below

        if batch%2==0:
            with ddp_model.no_sync():
                pred = ddp_model(model_in)
            with ddp_model.no_sync():
                loss = loss_fn(pred, label)
                loss.backward()
        else:        
            pred = ddp_model(model_in)
            loss = loss_fn(pred, label)
            loss.backward()

In this case, It works like case2; It sync!!

Is there any way to force not to synchronize gradients while decoupling forward and backward?

fduwjj · January 19, 2023, 6:21pm

Thanks for your feedback and question. Looks like this is a DDP specific question, @rvarm1 do you mind taking a look?

cxv0519 · January 27, 2023, 2:35am

First of all, thank you so much for your attention.

I could find the relevant annotation in the GitHub.

github.com

pytorch/pytorch/blob/master/torch/nn/parallel/distributed.py#L1065


      
          
          
Example::
          
          
    >>> # xdoctest: +SKIP("undefined variables")
              >>> ddp = torch.nn.parallel.DistributedDataParallel(model, pg)
              >>> with ddp.no_sync():
              >>>     for input in inputs:
              >>>         ddp(input).backward()  # no synchronization, accumulate grads
              >>> ddp(another_input).backward()  # synchronize grads
          
          
.. warning::
              The forward pass should be included inside the context manager, or
              else gradients will still be synchronized.
          """
          old_require_backward_grad_sync = self.require_backward_grad_sync
          self.require_backward_grad_sync = False
          try:
              yield
          finally:
              self.require_backward_grad_sync = old_require_backward_grad_sync

It says that the context manager should have forward and backward together.

If I follow this context manager decorator function, I think I can get a hint.
If the implementation I want is possible, I will leave an additional reply.