I have been trying to get distributed training up and running for a model that uses resnet as the base feature extractor.
if is_distributed: print("Trying to resolve host names now.") host_ip =  host_rank = resource_json["hosts"].index(resource_json["current_host"]) os.environ['MASTER_ADDR'] = dns_lookup(resource_json["hosts"]) #print(os.environ['MASTER_ADDR']) os.environ['MASTER_PORT'] = MASTER_PORT os.environ['WORLD_SIZE'] = str(size) os.environ['RANK'] = str(host_rank) set_nccl_environment(resource_json["network_interface_name"]) dist.init_process_group(init_method='', backend=args.backend)
However, I am getting this error
RuntimeError: world_size was not set in config at path/to/work/torch/lib/THD/process_group/General.cpp:17
Can someone help me out?