Clarifying discrepancies between DataParallel and DistributedDataParallel

Hi folks,
I’ve seen in the example here that for DataParallel one has to follow the following steps:

  1. Create model
  2. Wrap the model with DataParallel
  3. Send to device

These lead to my first confusion when I tried DistributedDataParallel following the pytorch imagenet example which actually has steps 2 and 3 in reverse order and without that order it throws an error about dense cuda tensors.

My second confusion comes again from following the pytorch imagenet example of DistributedDataParallel usage which utilises torch.utils.data.distributed.DistributedSampler to wrap the train_set but not the valid and test set. Why is that?

My final confusion arises from the fact that running DistributedDataParallel I observe the gpu volatility going up down like it’s the stock market all the way from 0% to 99%, anyone can shed some light on this?