Distributed Training using NCCL

Hi,
I have been trying to get distributed training up and running for a model that uses resnet as the base feature extractor.

if is_distributed:
        print("Trying to resolve host names now.")
        host_ip = []
        host_rank =  resource_json["hosts"].index(resource_json["current_host"])
       
        os.environ['MASTER_ADDR'] = dns_lookup(resource_json["hosts"][0])
        #print(os.environ['MASTER_ADDR'])
        os.environ['MASTER_PORT'] = MASTER_PORT
        os.environ['WORLD_SIZE'] = str(size)
        os.environ['RANK'] = str(host_rank)

        set_nccl_environment(resource_json["network_interface_name"])
        dist.init_process_group(init_method='', backend=args.backend)

However, I am getting this error

RuntimeError: world_size was not set in config at path/to/work/torch/lib/THD/process_group/General.cpp:17

Can someone help me out?