I try to improve the performance of my model by introducing a pre-trained teacher model. I have tried many ways to implement it. Unfortunately, none of them is successful. So here I come to request for help.

I incorporate two models. One is a pre-trained teacher model **T**, and the other is a student model **S**. The features generated by the intermediate layers of the student model **S**, are supervised by the features generated by the corresponding layers of the teacher model **T**. In other words, the loss function is the distance between the feature maps of **S** and **T**. The teacher model is frozen, while the student model is trained on 8 cards using DistributedDataParallel, *i.e,*

torch.nn.parallel.DistributedDataParallel(

S)

. The question is how to process the teacher **T** to make sure that the student models distributed at every cards are trained using the same teacher model? Do I need to apply torch.nn.parallel.DistributedDataParallel() on **T**? Can the trained teacher model in evaluation mode *i.e.* **T**.eval(), be distributed to multi cards？Thanks.