Hi folks,
I’ve seen in the example here that for DataParallel
one has to follow the following steps:
- Create model
- Wrap the model with
DataParallel
- Send to device
These lead to my first confusion when I tried DistributedDataParallel
following the pytorch imagenet example which actually has steps 2 and 3 in reverse order and without that order it throws an error about dense cuda tensors.
My second confusion comes again from following the pytorch imagenet example of DistributedDataParallel
usage which utilises torch.utils.data.distributed.DistributedSampler
to wrap the train_set
but not the valid
and test
set. Why is that?
My final confusion arises from the fact that running DistributedDataParallel
I observe the gpu volatility going up down like it’s the stock market all the way from 0% to 99%, anyone can shed some light on this?