What is the best practice for running distributed adversarial training?

I have opened a discussion here about a similar question regarding two DDP modules in a GAN setting Calling DistributedDataParallel on multiple Modules?. I’m still trying to determine if one process group can suffice, but it seems like the safest course of action is to use separate groups for G and D.

Regarding setting requires_grad to False on D while back propagating G’s loss, I have been meaning to implement that same thing but never got around to it. It seems like the logical approach, as it is just wasting compute time calculating gradients for D when they are going to be discarded.