DistributedDataParallel nccl freezing and gloo out of memory

Hello everybody,
I need help setting up DataDistributedParallel.

I would like to run 8 processes in parallel in 8 V100 Teslas but only one machine.

seed = args.seed + dist.get_rank()
torch.manual_seed(seed)
np.random.seed(seed)
random.seed(seed)
torch.cuda.set_device(dist.get_rank())

model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank])

torch.distributed.init_process_group(backend='gloo') # I tried many variations

args.device="cuda:{}".format(args.local_rank)
model.to(args.device)

To launch it I am using:

python3 -m torch.distributed.launch --nproc_per_node=8 train.py (other args...)

When I run it with ‘nccl’ as backend it will freeze in torch.nn.parallel.DistributedDataParallel.

When I use ‘gloo’ instead it claims I dont have memory:

RuntimeError: CUDA out of memory. Tried to allocate 224.00 MiB (GPU 0; 15.78 GiB total capacity; 724.41 MiB already allocated; 191.25 MiB free; 794.00 MiB reserved in total by PyTorch)
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank])

Which doesn’t make any sense to me because I was supposed to have enough memory according to the own error message.

Thank you very much :grinning:

I haven’t found any useful tutorials online as well.If you have a step-by-step one it would be very nice, thank you.

RuntimeErrorauthoritative_rank): 
NCCL error in: /opt/conda/conda-bld/pytorch_1614378098133/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, invalid usa
ge, NCCL version 2.7.8
nccl
InvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collect
ives at once, mixing streams in a group, etc).  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/dist
ributed.py", line 1156, in _distributed_broadcast_coalesced

I am not hitting either of those error messages when using NCCL or GLOO for torch=1.8.0, both are working fine. Here is the script I am using which is based off of yours (with a dummy model):

import torch
import torch.distributed as dist
import torch.nn as nn
import numpy as np
import random
import argparse
import os

os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "29500"

parser = argparse.ArgumentParser(description='Process some integers.')
parser.add_argument('--local_rank', metavar='N', type=int,
                    help='rank')
parser.add_argument('--seed', metavar='N', type=int,
                    help='seed')
args = parser.parse_args()
args.device="cuda:{}".format(args.local_rank)

model = nn.Linear(1, 1).to(args.device)

torch.distributed.init_process_group(backend='nccl')
seed = args.seed + dist.get_rank()
torch.manual_seed(seed)
np.random.seed(seed)
random.seed(seed)
torch.cuda.set_device(dist.get_rank())

model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank])
model.to(args.device)

and I am launching with python3 -m torch.distributed.launch --nproc_per_node=2 test.py --seed=42

Can you provide more detail about the model you are using and your torch version?

1 Like