What is the appropriate way to initialize a teacher model for distillation?

What I want to do:

  1. load a teacher model,
  2. during training, do forward pass using teacher model and calculate a L2 loss with the student model to update the student model

What I did:

  1. construct the teacher model under torch.no_grad() block

My question:

  1. do I still need to set require_grad = False for all teacher parameters to make sure teacher model is freezed and no backpropagation on teacher model ?
  2. Or , should I put output = teacher_model(input) under torch.no_grad() block too?