DDP Update Teacher parameters from student parameters

Hello,

I am currently training a teacher and student model for unsupervised domain adaptation with pseudo labels. The teacher model is an exponential moving average of the student model and is updated using the student weights after each back propagation step of the student model. The weights of the teacher should not be updated by any gradient and is only used to predict the detached pseudo labels. I get the following error:


> RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
> Parameter at index 433 has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print parameter names for further debugging.

To get the teacher network, I create a checkpoint of the student model then load the state dict to the teacher model, this is working using the following function:

    def create_ema_model_copy(self):
        """
        Creates an exponential moving average copy of the model in order to predict the pseudo labels. (Using the same
        model for training and the predictions produces a high bias and results in collapse of semantic classes in the
        preidction.)
        """
        self.ema_model = models.get_model(self.cfg.model.type, self.cfg)
        if self.cfg.device.multiple_gpus:
            self.ema_model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(self.ema_model)
        self.ema_model.to(self.device)

        if self.cfg.device.multiple_gpus:
            self.ema_model = CustomDistributedDataParallel(self.ema_model,
                                                           device_ids=[self.rank],
                                                           find_unused_parameters=True)

        # load the weights
        checkpoint = io_utils.IOHandler.gen_checkpoint(
            self.model.get_networks(), **{"cfg": self.cfg})

        io_utils.IOHandler.load_weights(checkpoint, self.ema_model.get_networks())

        for param in self.ema_model.parameters():
            param.detach_()

To update the teacher network, I call the following function after the loss.backward():

    def update_ema_model_copy(self, iteration, teacher_weight):
        """Copied and modified from https://github.com/vikolss/DACS/blob/master/trainUDA.py"""
        teacher_weight = min(1 - 1 / (iteration + 1), teacher_weight)
        for ema_param, param in zip(self.ema_model.parameters(), self.model.parameters()):
            ema_param.data[:] = teacher_weight * ema_param[:].data[:] + (1 - teacher_weight) * param[:].data[:]

How can I use the student model weights to update the teacher model ?

@blowtorch Have you tried setting _set_static_graph as suggested in the error message?