I want to train imagenet classification with distributed multinode.
I found example in https://github.com/pytorch/examples/blob/master/imagenet/main.py
and I failed to run it.
Does anyone have an experience to train imagenet in distributed nodes?
Cay anyone tell me how to execute it?
My try1:
$ CUDA_VISIBLE_DEVICES=0 python main.py /dataset/imagenet_classify/ --world-size 2 --dist-url tcp://127.0.0.1:29500 --dist-backend tcp
Traceback (most recent call last):
File βmain.pyβ, line 315, in
main()
File βmain.pyβ, line 70, in main
world_size=args.world_size)
File β/home/andrew/ml/local/lib/python2.7/site-packages/torch/distributed/init.pyβ, line 46, in init_process_group
group_name, rank)
RuntimeError: tcp:// method with non-multicast addresses requires manual rank assignment at /pytorch/torch/lib/THD/process_group/General.cpp:17
$ CUDA_VISIBLE_DEVICES=1 python main.py /dataset/imagenet_classify/ --world-size 2 --dist-url tcp://127.0.0.1:29500 --dist-backend tcp
Traceback (most recent call last):
File βmain.pyβ, line 315, in
main()
File βmain.pyβ, line 70, in main
world_size=args.world_size)
File β/home/andrew/ml/local/lib/python2.7/site-packages/torch/distributed/init.pyβ, line 46, in init_process_group
group_name, rank)
RuntimeError: tcp:// method with non-multicast addresses requires manual rank assignment at /pytorch/torch/lib/THD/process_group/General.cpp:17
My try2:
I changed the line 68 like the below and executed. I got error message.
if args.distributed:
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'
dist.init_process_group(backend=args.dist_backend, rank=args.rank,
world_size=args.world_size)
$ CUDA_VISIBLE_DEVICES=0 python main.py /dataset/imagenet_classify/ --world-size 2 --rank 0
=> creating model 'resnet18β
Traceback (most recent call last):
File βmain.pyβ, line 323, in
main()
File βmain.pyβ, line 96, in main
model = torch.nn.parallel.DistributedDataParallel(model)
File β/home/andrew/ml/local/lib/python2.7/site-packages/torch/nn/parallel/distributed.pyβ, line 94, in init
dist.broadcast(p, 0)
File β/home/andrew/ml/local/lib/python2.7/site-packages/torch/distributed/init.pyβ, line 191, in broadcast
return torch._C._dist_broadcast(tensor, src, group)
RuntimeError: Bad address
$ CUDA_VISIBLE_DEVICES=1 python main.py /dataset/imagenet_classify/ --world-size 2 --rank 1
=> creating model 'resnet18β
Traceback (most recent call last):
File βmain.pyβ, line 323, in
main()
File βmain.pyβ, line 96, in main
model = torch.nn.parallel.DistributedDataParallel(model)
File β/home/andrew/ml/local/lib/python2.7/site-packages/torch/nn/parallel/distributed.pyβ, line 94, in init
dist.broadcast(p, 0)
File β/home/andrew/ml/local/lib/python2.7/site-packages/torch/distributed/init.pyβ, line 191, in broadcast
return torch._C._dist_broadcast(tensor, src, group)
RuntimeError: Connection reset by peer