Torch.distributed.init_process_group: RuntimeError: proc_group_name != group_name

Hi Pytorch,

Pytorch: v0.3.1:latest

I try to use torch.distributed.init_process_group() with gloo, and it raises eror: RuntimeError: proc_group_name != group_name. I’m confused about the error message, could anyone please tell me what’s this error indicates?


if args.distributed:
dist.init_process_group(backend=‘gloo’, init_method=“file://xxxxx/{}”.format(args.dist_file), world_size=args.world_size, group_name=args.dist_file)

RuntimeError: proc_group_name != group_name at /opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THD/process_group/General.cpp:17

It seems that there are some conflicts on writing shared file on HDFS, I resolve it by manually wait different seconds for different rank.