Knowledge distillation with DDP

This post is basically asking this question again.

In a distributed data parallel setup, we use a frozen teacher model to regularize the training of a student. The student has to be wrapped by DDP wrapper, but how about the teacher?

Could anyone give an explanation why or why not the teacher needs to be DDP wrapped?

Note that the purpose of wrapping the model with DDP is to ensure that each replica (e.g., one per GPU) of the model stays in sync during training after e.g., each replica computes its gradients (and if normalization layers like batchnorm are used, their statistics also need to be synchronized). Since there would be no gradient updates or normalization statistics needed for the teacher model (because it is frozen), it wouldn’t be necessary to wrap it with DDP.

I see. Thank you for your detailed reply! @eqy

As a follow-up question, how about DataParallel? Would it be necessary to wrap the teacher with torch.nn.DataParallel?

I don’t think so, as DataParallel is effectively a more limited version of DDP and would not be necessary for the same reasons DDP wouldn’t be necessary.