Can you perform testing while the model is in DistributedDataParallel mode?

txcode · December 23, 2021, 6:03pm

I wanted to ask, does utilizing DistributedDataParallel impact validation/testing? I’m wondering if it is still possible to “pause” after each epoch and test the model.

Does DistributedDataParallel prevent testing the model during the training loop (after an epoch)?

Since the data and model are trained on different devices, I was unsure if there was an issue with testing the model.

Also, if you can test, should it be done on only a single node, or how should the testing loop be incorporated? For all training agents?

For reference the docs are shown here.

mrshenli · January 4, 2022, 4:05am

github.com

pytorch/pytorch/blob/d9106116aa5e399f7d63feeb7fc77f92a076dd93/torch/nn/parallel/distributed.py#L939-L949

    
      
          self.require_forward_param_sync = True
          # We'll return the output object verbatim since it is a freeform
          # object. We need to find any tensors in this object, though,
          # because we need to figure out which parameters were used during
          # this forward pass, to ensure we short circuit reduction for any
          # unused parameters. Only if `find_unused_parameters` is set.
          if self.find_unused_parameters and not self.static_graph:
              # Do not need to populate this for static graph.
              self.reducer.prepare_for_backward(list(_find_tensors(output)))
          else:
              self.reducer.prepare_for_backward([])

The code above would make DistributedDataParallel expecting a backward after the current forward. But you can run the validation/testing forward within a with torch.no_grad(): context to force into a different branch.