Distributed data parallel slower than data parallel?

I’ve come up across this strange thing where in a simple setting training vgg16 for 10 epochs is fater with data parallel than distributed data parallel.

image
image

image
image

MWE:

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = vgg16(...)
if dist and torch.distributed.is_available():
   torch.distributed.init_process_group(backend='nccl', init_method='env://')
   sampler = torch.utils.data.distributed.DistributedSampler
else:
   sampler = torch.utils.data.SubsetRandomSampler

db = Datasets(workers=4, pin_memory=True, sampler=sampler)
optimizer = Adam()
if torch.distributed.is_initialized():
   model.to(device)
   model = torch.nn.parallel.DistributedDataParallel(model)
elif torch.cuda.device_count() > 1:
     model = torch.nn.parallel.DataParallel(model)
     model.to(device)
else:
     model.to(device)

training loop.....

Launching distributed training with

python -m torch.distributed.launch main.py

The docs identify 2 cases.

  1. Single-Process Multi-GPU
  2. Multi-Process Single-GPU

I believe that my code fall under 1.

But to achieve 2. which says is faster it describes the following changes that should be made

torch.distributed.init_process_group(backend='nccl', world_size=4, init_method='...')
model = DistributedDataParallel(model, device_ids=[i], output_device=i)

My understanding is that I should change the following in my code
torch.distributed.init_process_group(backend='nccl', init_method='env://')
into
torch.distributed.init_process_group(backend='nccl', world_size=4, init_method='...')

So far so good.

But, I’m lost in this line
model = DistributedDataParallel(model, device_ids=[i], output_device=i)

In the docs it says that i corresponds to a particular gpu.
Suppose we have 2 gpus for the sake of example and I want to run the above MWE (moved into file main.py).

Usually I would run main.py using something like python -m torch.distributed.launch main.py --i=1.
Now my confusion arises from i which only specifies one of the 2 available gpus, how’s that distributed training? Or should I specify --i=[0, 1]?

Any clarifications or pointers to mistakes or misunderstandings that I’ve made are highly appreciated it.

Thanks.

I made a mistake above when launching distributed training with torch.distributed.launch I should have specified --nproc_pre_node.

So I did that running torch.distributed.launch --nproc_per_node=2 where 2 corresponds to num of process for each gpu in the system, and the launch help manual recommends that but I got even worse results not faster.

image
image

Also this message

Unless I’m missing something fundamental here why is it said that distributed parallel is faster than data parallel?

We recommend the usage of DDP with a single process per GPU.
The launch.py script has some example usages for it.
E.g. you could use --local_rank in your script to specify the GPU.

Thanks for the reply, please correct me if I’m wrong but the purpose of distributed training is to be used for:

  1. Multi-node & multi-gpu training
  2. Single-node & multi-gpu training → my use case scenario

I’ve seen them and according to the recommendation shown below I launch my script for my scenario as explained above.
How to use this module:

  1. Single-Node multi-process distributed training
    ::

    python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
    YOUR_TRAINING_SCRIPT.py (–arg1 --arg2 --arg3 and all other
    arguments of your training script)

This is what I don’t understand why do I need to specify gpu for distrubed training on single-node multi-gpu?

Obviously I want to use all gpus do I still need --local_rank I thought it was used to specify gpus in multi-node scenario where nodes might have different number of gpus.

When launching launch.py with --nproc_per_node=2 where 2 is the num of gpus it returns --local_rank=0 shouldn’t that be 2 instead for each gpu?

Even if I parse the --local-rank argument in my script and set the device to --local_rank that would still be using only one of the available gpus. I need to use all, that where there’s confusion in my understanding. How do we do that?