Hanging distributed data parallel on interactive example

Anybody can clarify why I keep getting cuda error when using torch.nn.parallel.DistributedDataParallel?

MWE:

if torch.cuda.device_count() > 1:
    torch.distributed.init_process_group(backend='nccl')
    model = torch.nn.parallel.DistributedDataParallel(model).to(device)

executing script:

python -m torch.distributed.launch main.py

Am I using it incorrectly or am I missing something else?

A MWE:

import os
import torch
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
torch.distributed.init_process_group(backend='nccl', rank=1, world_size=2)

Then the universe hangs completely:

Terminal is not released back to the user. Any ideas why?

You have to use DistributedDataParallel with as many processes as the value of world_size. If you specify rank=1 on a single process, it will hang around waiting for the process with rank=0 to start.

Check out the launch utility for easy launching of multiple processes. You can use it like this:

python -m torch.distributed.launch --nproc_per_node=2 ./my_script.py

Awesome, thanks so much for clarifying that!
If you don’t mind me asking for some additional clarifications?

The lauch utility has to be used in combination with DistributedDataParallel right, or can it be used on its own?

Using DistributedDataParallel does it mean that it has to be used in combination with torch.utils.data.distributed.DistributedSampler?

Thanks again!

The correct call for multi-gpu single node is:
world_size = 1 <-- number of nodes
rank=0 <-- which node is running which process