I tried DDP no_sync()
well…
I thought no_sync() make not to communicate among ddp group, but it seems not as far as I experimented.
It shows two different operating characteristics in three cases.
Case 1: forward, backward in same no_sync() like below
if batch%2==0:
with ddp_model.no_sync():
pred = ddp_model(model_in)
loss = loss_fn(pred, label)
loss.backward()
else:
pred = ddp_model(model_in)
loss = loss_fn(pred, label)
loss.backward()
It works well.
I mean It accumulates even gradients without synchronization.
Case 2: forward is out of no_sync() and backward is in no_sync() like below
pred = ddp_model(model_in)
if batch%2==0:
with ddp_model.no_sync():
loss = loss_fn(pred, label)
loss.backward()
else:
loss = loss_fn(pred, label)
loss.backward()
Well…
In this case, It sync gradients!!
Case3: forward, backward is in no_sync(), but different context like below
if batch%2==0:
with ddp_model.no_sync():
pred = ddp_model(model_in)
with ddp_model.no_sync():
loss = loss_fn(pred, label)
loss.backward()
else:
pred = ddp_model(model_in)
loss = loss_fn(pred, label)
loss.backward()
In this case, It works like case2; It sync!!
Is there any way to force not to synchronize gradients while decoupling forward and backward?