What's no_sync() exactly do in DDP

I tried DDP no_sync()

well…
I thought no_sync() make not to communicate among ddp group, but it seems not as far as I experimented.

It shows two different operating characteristics in three cases.

Case 1: forward, backward in same no_sync() like below

        if batch%2==0:
            with ddp_model.no_sync():
                pred = ddp_model(model_in)
                loss = loss_fn(pred, label)
                loss.backward()
        else:        
            pred = ddp_model(model_in)
            loss = loss_fn(pred, label)
            loss.backward()

It works well.
I mean It accumulates even gradients without synchronization.

Case 2: forward is out of no_sync() and backward is in no_sync() like below

        pred = ddp_model(model_in)
        if batch%2==0:
            with ddp_model.no_sync():
                loss = loss_fn(pred, label)
                loss.backward()
        else:        
            loss = loss_fn(pred, label)
            loss.backward()

Well…
In this case, It sync gradients!!

Case3: forward, backward is in no_sync(), but different context like below

        if batch%2==0:
            with ddp_model.no_sync():
                pred = ddp_model(model_in)
            with ddp_model.no_sync():
                loss = loss_fn(pred, label)
                loss.backward()
        else:        
            pred = ddp_model(model_in)
            loss = loss_fn(pred, label)
            loss.backward()

In this case, It works like case2; It sync!!

Is there any way to force not to synchronize gradients while decoupling forward and backward?

1 Like

Thanks for your feedback and question. Looks like this is a DDP specific question, @rvarm1 do you mind taking a look?

1 Like

First of all, thank you so much for your attention.

I could find the relevant annotation in the GitHub.

It says that the context manager should have forward and backward together.

If I follow this context manager decorator function, I think I can get a hint.
If the implementation I want is possible, I will leave an additional reply.