Anybody can clarify why I keep getting cuda error when using torch.nn.parallel.DistributedDataParallel?
if torch.cuda.device_count() > 1:
model = torch.nn.parallel.DistributedDataParallel(model).to(device)
python -m torch.distributed.launch main.py
Am I using it incorrectly or am I missing something else?
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
torch.distributed.init_process_group(backend='nccl', rank=1, world_size=2)
Then the universe hangs completely:
Terminal is not released back to the user. Any ideas why?
You have to use
DistributedDataParallel with as many processes as the value of
world_size. If you specify
rank=1 on a single process, it will hang around waiting for the process with
rank=0 to start.
Check out the launch utility for easy launching of multiple processes. You can use it like this:
python -m torch.distributed.launch --nproc_per_node=2 ./my_script.py
Awesome, thanks so much for clarifying that!
If you don’t mind me asking for some additional clarifications?
The lauch utility has to be used in combination with
DistributedDataParallel right, or can it be used on its own?
DistributedDataParallel does it mean that it has to be used in combination with
The correct call for multi-gpu single node is:
world_size = 1 <-- number of nodes
rank=0 <-- which node is running which process