Single Node Distributed Training

Sanjay_Kumar · November 16, 2018, 7:15pm

I have 8-GPU machine and I have followed the script https://github.com/pytorch/examples/tree/master/imagenet for training resnet50. But when I initialize
dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url, world_size=args.world_size, rank=args.rank)
the script freezes without any error. Is it possible to create single node for distributed training.