I try to improve the performance of my model by introducing a pre-trained teacher model. I have tried many ways to implement it. Unfortunately, none of them is successful. So here I come to request for help.
I incorporate two models. One is a pre-trained teacher model T, and the other is a student model S. The features generated by the intermediate layers of the student model S, are supervised by the features generated by the corresponding layers of the teacher model T. In other words, the loss function is the distance between the feature maps of S and T. The teacher model is frozen, while the student model is trained on 8 cards using DistributedDataParallel, i.e,
. The question is how to process the teacher T to make sure that the student models distributed at every cards are trained using the same teacher model? Do I need to apply torch.nn.parallel.DistributedDataParallel() on T? Can the trained teacher model in evaluation mode i.e. T.eval(), be distributed to multi cards？Thanks.