Model and ddp wrapped model

valiantljk · June 2, 2021, 2:13am

Hi, I’m using allennlp to do distributed bert training.
In their code, model has some customized functions, e.g., get_metrics, and get_regularization_penalty. After wrapping it with ddp, there is a comment says

        # Using `DistributedDataParallel`(ddp) brings in a quirk wrt AllenNLP's `Model` interface and its
        # usage. A `Model` object is wrapped by `ddp`, but assigning the wrapped model to `self.model`
        # will break the usages such as `Model.get_regularization_penalty`, `Model.get_metrics`, etc.
        #
        # Hence a reference to Pytorch's object is maintained in the case of distributed training and in the
        # normal case, reference to `Model` is retained. This reference is only used in
        # these places: `model.__call__`, `model.train` and `model.eval`.

github.com

allenai/allennlp/blob/c5bff8ba0d835eb03931f10f4f427ffe936cf796/allennlp/training/gradient_descent_trainer.py#L302

    
      
          self._num_gradient_accumulation_steps = num_gradient_accumulation_steps
          
          
# Enable automatic mixed precision training.
          self._scaler: Optional[amp.GradScaler] = None
          self._use_amp = use_amp
          if self._use_amp:
              if self.cuda_device == torch.device("cpu"):
                  raise ValueError("Using AMP requires a cuda device")
              self._scaler = amp.GradScaler()
          
          
# Using `DistributedDataParallel`(ddp) brings in a quirk wrt AllenNLP's `Model` interface and its
          # usage. A `Model` object is wrapped by `ddp`, but assigning the wrapped model to `self.model`
          # will break the usages such as `Model.get_regularization_penalty`, `Model.get_metrics`, etc.
          #
          # Hence a reference to Pytorch's object is maintained in the case of distributed training and in the
          # normal case, reference to `Model` is retained. This reference is only used in
          # these places: `model.__call__`, `model.train` and `model.eval`.
          if self._distributed:
              self._pytorch_model = DistributedDataParallel(
                  self.model,
                  device_ids=None if self.cuda_device == torch.device("cpu") else [self.cuda_device],

My question is what is the relationship between self.model and its wrapped version self._pytorch_model?

Do they share parameters and runtime state?

tom · June 2, 2021, 5:23pm

You have one object for each of the three classes

m = ThePyTorchModel (without DDP)
ddp_m = DistributedDataParallel(ThePyTorchModel)
anlp_m = Model(ThePyTorchModel) (AllenNLP’s model class)

ddp_m and anlp_m wrap (i.e. contain a reference to) the (same) instance m as .module and .model usually.

Now AllenNLP doesn’t want to special case and write .model.module if isinstance(.model, DDP) else .model all the time, so it leaves .model to be the regular model m but stores ddp_m as the PyTorch ._pytorch_model.

So you should have anlp_m._pytorch_model.module is anlp_m.model return True, they are indeed the very same object.

Yes in the above sense (that you have an additional hierarchy level when going through DDP).

Best regards

Thomas

valiantljk · June 2, 2021, 8:37pm

Thanks @tom. Very clear now.
A follow up question, when the ddp wrapped model is copied onto one cuda device, does the original model still hold the reference to their common ancestor?

tom · June 2, 2021, 8:41pm

Yes, I think so. ← Service smile to reach the minimum length.