Question about the usage of DistributedDataParallel

Ford · March 3, 2022, 10:55am

I try to improve the performance of my model by introducing a pre-trained teacher model. I have tried many ways to implement it. Unfortunately, none of them is successful. So here I come to request for help.

I incorporate two models. One is a pre-trained teacher model T, and the other is a student model S. The features generated by the intermediate layers of the student model S, are supervised by the features generated by the corresponding layers of the teacher model T. In other words, the loss function is the distance between the feature maps of S and T. The teacher model is frozen, while the student model is trained on 8 cards using DistributedDataParallel, i.e,

torch.nn.parallel.DistributedDataParallel(S)

. The question is how to process the teacher T to make sure that the student models distributed at every cards are trained using the same teacher model? Do I need to apply torch.nn.parallel.DistributedDataParallel() on T? Can the trained teacher model in evaluation mode i.e. T.eval(), be distributed to multi cards？Thanks.

pritamdamania87 · March 4, 2022, 3:46am

Since the teacher model is frozen, you don’t need to use torch.nn.parallel.DistributedDataParallel for it. You can do the following on each process:

T = torch.load(...) # load the teacher model
S = torch.load(...) # load the student model
ddp_S = torch.nn.parallel.DistributedDataParallel(S)

# Now you can use T and ddp_S for your training.