I want to train imagenet classification in distributed nodes. I face the error from example codes

I want to train imagenet classification with distributed multinode.
I found example in https://github.com/pytorch/examples/blob/master/imagenet/main.py
and I failed to run it.

Does anyone have an experience to train imagenet in distributed nodes?
Cay anyone tell me how to execute it?

My try1:

$ CUDA_VISIBLE_DEVICES=0 python main.py /dataset/imagenet_classify/ --world-size 2 --dist-url tcp://127.0.0.1:29500 --dist-backend tcp
Traceback (most recent call last):
File β€œmain.py”, line 315, in
main()
File β€œmain.py”, line 70, in main
world_size=args.world_size)
File β€œ/home/andrew/ml/local/lib/python2.7/site-packages/torch/distributed/init.py”, line 46, in init_process_group
group_name, rank)
RuntimeError: tcp:// method with non-multicast addresses requires manual rank assignment at /pytorch/torch/lib/THD/process_group/General.cpp:17

$ CUDA_VISIBLE_DEVICES=1 python main.py /dataset/imagenet_classify/ --world-size 2 --dist-url tcp://127.0.0.1:29500 --dist-backend tcp
Traceback (most recent call last):
File β€œmain.py”, line 315, in
main()
File β€œmain.py”, line 70, in main
world_size=args.world_size)
File β€œ/home/andrew/ml/local/lib/python2.7/site-packages/torch/distributed/init.py”, line 46, in init_process_group
group_name, rank)
RuntimeError: tcp:// method with non-multicast addresses requires manual rank assignment at /pytorch/torch/lib/THD/process_group/General.cpp:17

My try2:

I changed the line 68 like the below and executed. I got error message.

if args.distributed:
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(backend=args.dist_backend, rank=args.rank,
                            world_size=args.world_size)

$ CUDA_VISIBLE_DEVICES=0 python main.py /dataset/imagenet_classify/ --world-size 2 --rank 0
=> creating model 'resnet18’
Traceback (most recent call last):
File β€œmain.py”, line 323, in
main()
File β€œmain.py”, line 96, in main
model = torch.nn.parallel.DistributedDataParallel(model)
File β€œ/home/andrew/ml/local/lib/python2.7/site-packages/torch/nn/parallel/distributed.py”, line 94, in init
dist.broadcast(p, 0)
File β€œ/home/andrew/ml/local/lib/python2.7/site-packages/torch/distributed/init.py”, line 191, in broadcast
return torch._C._dist_broadcast(tensor, src, group)
RuntimeError: Bad address

$ CUDA_VISIBLE_DEVICES=1 python main.py /dataset/imagenet_classify/ --world-size 2 --rank 1
=> creating model 'resnet18’
Traceback (most recent call last):
File β€œmain.py”, line 323, in
main()
File β€œmain.py”, line 96, in main
model = torch.nn.parallel.DistributedDataParallel(model)
File β€œ/home/andrew/ml/local/lib/python2.7/site-packages/torch/nn/parallel/distributed.py”, line 94, in init
dist.broadcast(p, 0)
File β€œ/home/andrew/ml/local/lib/python2.7/site-packages/torch/distributed/init.py”, line 191, in broadcast
return torch._C._dist_broadcast(tensor, src, group)
RuntimeError: Connection reset by peer

I think the problem is caused by gpu tensor and cpu tensor.

1 Like

I am encountering exactly the same problem. Has this been debugged yet?

OK I figured out the answer. According to http://pytorch.org/docs/master/distributed.html TCP backend doesn’t support GPU, so just need to take out model.cuda()