When I run it with ‘nccl’ as backend it will freeze in torch.nn.parallel.DistributedDataParallel.
When I use ‘gloo’ instead it claims I dont have memory:
RuntimeError: CUDA out of memory. Tried to allocate 224.00 MiB (GPU 0; 15.78 GiB total capacity; 724.41 MiB already allocated; 191.25 MiB free; 794.00 MiB reserved in total by PyTorch)
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank])
Which doesn’t make any sense to me because I was supposed to have enough memory according to the own error message.
RuntimeErrorauthoritative_rank):
NCCL error in: /opt/conda/conda-bld/pytorch_1614378098133/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, invalid usa
ge, NCCL version 2.7.8
nccl
InvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collect
ives at once, mixing streams in a group, etc). File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/dist
ributed.py", line 1156, in _distributed_broadcast_coalesced
I am not hitting either of those error messages when using NCCL or GLOO for torch=1.8.0, both are working fine. Here is the script I am using which is based off of yours (with a dummy model):
import torch
import torch.distributed as dist
import torch.nn as nn
import numpy as np
import random
import argparse
import os
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "29500"
parser = argparse.ArgumentParser(description='Process some integers.')
parser.add_argument('--local_rank', metavar='N', type=int,
help='rank')
parser.add_argument('--seed', metavar='N', type=int,
help='seed')
args = parser.parse_args()
args.device="cuda:{}".format(args.local_rank)
model = nn.Linear(1, 1).to(args.device)
torch.distributed.init_process_group(backend='nccl')
seed = args.seed + dist.get_rank()
torch.manual_seed(seed)
np.random.seed(seed)
random.seed(seed)
torch.cuda.set_device(dist.get_rank())
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank])
model.to(args.device)
and I am launching with python3 -m torch.distributed.launch --nproc_per_node=2 test.py --seed=42
Can you provide more detail about the model you are using and your torch version?