Distributed data parallel slower than data parallel?

Thanks for the reply, please correct me if I’m wrong but the purpose of distributed training is to be used for:

  1. Multi-node & multi-gpu training
  2. Single-node & multi-gpu training → my use case scenario

I’ve seen them and according to the recommendation shown below I launch my script for my scenario as explained above.
How to use this module:

  1. Single-Node multi-process distributed training
    ::

    python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
    YOUR_TRAINING_SCRIPT.py (–arg1 --arg2 --arg3 and all other
    arguments of your training script)

This is what I don’t understand why do I need to specify gpu for distrubed training on single-node multi-gpu?

Obviously I want to use all gpus do I still need --local_rank I thought it was used to specify gpus in multi-node scenario where nodes might have different number of gpus.

When launching launch.py with --nproc_per_node=2 where 2 is the num of gpus it returns --local_rank=0 shouldn’t that be 2 instead for each gpu?

Even if I parse the --local-rank argument in my script and set the device to --local_rank that would still be using only one of the available gpus. I need to use all, that where there’s confusion in my understanding. How do we do that?