Distributed data parallel slower than data parallel?

kirk86 · March 4, 2020, 7:35pm

I’ve come up across this strange thing where in a simple setting training vgg16 for 10 epochs is fater with data parallel than distributed data parallel.

MWE:

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = vgg16(...)
if dist and torch.distributed.is_available():
   torch.distributed.init_process_group(backend='nccl', init_method='env://')
   sampler = torch.utils.data.distributed.DistributedSampler
else:
   sampler = torch.utils.data.SubsetRandomSampler

db = Datasets(workers=4, pin_memory=True, sampler=sampler)
optimizer = Adam()
if torch.distributed.is_initialized():
   model.to(device)
   model = torch.nn.parallel.DistributedDataParallel(model)
elif torch.cuda.device_count() > 1:
     model = torch.nn.parallel.DataParallel(model)
     model.to(device)
else:
     model.to(device)

training loop.....

Launching distributed training with

python -m torch.distributed.launch main.py

The docs identify 2 cases.

Single-Process Multi-GPU
Multi-Process Single-GPU

I believe that my code fall under 1.

But to achieve 2. which says is faster it describes the following changes that should be made

torch.distributed.init_process_group(backend='nccl', world_size=4, init_method='...')
model = DistributedDataParallel(model, device_ids=[i], output_device=i)

My understanding is that I should change the following in my code
torch.distributed.init_process_group(backend='nccl', init_method='env://')
into
torch.distributed.init_process_group(backend='nccl', world_size=4, init_method='...')

So far so good.

But, I’m lost in this line
model = DistributedDataParallel(model, device_ids=[i], output_device=i)

In the docs it says that i corresponds to a particular gpu.
Suppose we have 2 gpus for the sake of example and I want to run the above MWE (moved into file main.py).

Usually I would run main.py using something like python -m torch.distributed.launch main.py --i=1.
Now my confusion arises from i which only specifies one of the 2 available gpus, how’s that distributed training? Or should I specify --i=[0, 1]?

Any clarifications or pointers to mistakes or misunderstandings that I’ve made are highly appreciated it.

Thanks.

kirk86 · March 4, 2020, 7:53pm

I made a mistake above when launching distributed training with torch.distributed.launch I should have specified --nproc_pre_node.

So I did that running torch.distributed.launch --nproc_per_node=2 where 2 corresponds to num of process for each gpu in the system, and the launch help manual recommends that but I got even worse results not faster.

Also this message

Unless I’m missing something fundamental here why is it said that distributed parallel is faster than data parallel?

ptrblck · March 5, 2020, 5:07am

We recommend the usage of DDP with a single process per GPU.
The launch.py script has some example usages for it.
E.g. you could use --local_rank in your script to specify the GPU.

kirk86 · March 5, 2020, 4:42pm

Thanks for the reply, please correct me if I’m wrong but the purpose of distributed training is to be used for:

Multi-node & multi-gpu training
Single-node & multi-gpu training → my use case scenario

I’ve seen them and according to the recommendation shown below I launch my script for my scenario as explained above.
How to use this module:

Single-Node multi-process distributed training
::

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
YOUR_TRAINING_SCRIPT.py (–arg1 --arg2 --arg3 and all other
arguments of your training script)

This is what I don’t understand why do I need to specify gpu for distrubed training on single-node multi-gpu?

Obviously I want to use all gpus do I still need --local_rank I thought it was used to specify gpus in multi-node scenario where nodes might have different number of gpus.

When launching launch.py with --nproc_per_node=2 where 2 is the num of gpus it returns --local_rank=0 shouldn’t that be 2 instead for each gpu?

Even if I parse the --local-rank argument in my script and set the device to --local_rank that would still be using only one of the available gpus. I need to use all, that where there’s confusion in my understanding. How do we do that?