I am trying to use two gpus on my windows machine, but I keep getting raise RuntimeError("Distributed package doesn't have NCCL " "built in") RuntimeError: Distributed package doesn't have NCCL built in
I am still new to pytorch and couldnt really find a way of setting the backend to ‘gloo’. I followed this link by setting the following but still no luck.
As NLCC is not available on windows I had to tweak the ‘setup_devices’ method of ‘training_args.py’ and write:
torch.distributed.init_process_group(backend=“nccl”) → torch.distributed.init_process_group(backend=“gloo”)
along with the ‘distributed_concat’ in ‘trainer_pt_utils.py’:
dist.all_gather(output_tensors, tensor) → dist.all_gather(output_tensors, tensor if len(tensor.shape) > 0 else tensor[None])
I am actually not sure what to put on rank. This code doesnt give an error but the command prompt freezes. I am using two gpus. What would be my rank. if I input rank = [0,1]. it gives an error rank should an integer.