Model and ddp wrapped model

Hi, I’m using allennlp to do distributed bert training.
In their code, model has some customized functions, e.g., get_metrics, and get_regularization_penalty. After wrapping it with ddp, there is a comment says

        # Using `DistributedDataParallel`(ddp) brings in a quirk wrt AllenNLP's `Model` interface and its
        # usage. A `Model` object is wrapped by `ddp`, but assigning the wrapped model to `self.model`
        # will break the usages such as `Model.get_regularization_penalty`, `Model.get_metrics`, etc.
        #
        # Hence a reference to Pytorch's object is maintained in the case of distributed training and in the
        # normal case, reference to `Model` is retained. This reference is only used in
        # these places: `model.__call__`, `model.train` and `model.eval`.

My question is what is the relationship between self.model and its wrapped version self._pytorch_model?

Do they share parameters and runtime state?

You have one object for each of the three classes

  1. m = ThePyTorchModel (without DDP)
  2. ddp_m = DistributedDataParallel(ThePyTorchModel)
  3. anlp_m = Model(ThePyTorchModel) (AllenNLP’s model class)

ddp_m and anlp_m wrap (i.e. contain a reference to) the (same) instance m as .module and .model usually.

Now AllenNLP doesn’t want to special case and write .model.module if isinstance(.model, DDP) else .model all the time, so it leaves .model to be the regular model m but stores ddp_m as the PyTorch ._pytorch_model.

So you should have anlp_m._pytorch_model.module is anlp_m.model return True, they are indeed the very same object.

Yes in the above sense (that you have an additional hierarchy level when going through DDP).

Best regards

Thomas

1 Like

Thanks @tom. Very clear now.
A follow up question, when the ddp wrapped model is copied onto one cuda device, does the original model still hold the reference to their common ancestor?

Yes, I think so. :slight_smile: ← Service smile to reach the minimum length.

1 Like