I am trying to use two gpus on my windows machine, but I keep getting
raise RuntimeError("Distributed package doesn't have NCCL " "built in") RuntimeError: Distributed package doesn't have NCCL built in
I am still new to pytorch and couldnt really find a way of setting the backend to ‘gloo’. I followed this link by setting the following but still no luck.
As NLCC is not available on windows I had to tweak the ‘setup_devices’ method of ‘training_args.py’ and write:
torch.distributed.init_process_group(backend=“nccl”) → torch.distributed.init_process_group(backend=“gloo”)
along with the ‘distributed_concat’ in ‘trainer_pt_utils.py’:
dist.all_gather(output_tensors, tensor) → dist.all_gather(output_tensors, tensor if len(tensor.shape) > 0 else tensor[None])
How to set backend=‘gloo’ and from where?
@Mo_Balut could you please show your code?
torch.distributed.init_process_group(backend=“gloo”) is the right way to use gloo
thx for replying!
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
trainer = pl.Trainer.from_argparse_args(args, checkpoint_callback=checkpoint_callback)
model = T5Finetuner(hparams=args,
I am actually not sure what to put on rank. This code doesnt give an error but the command prompt freezes. I am using two gpus. What would be my rank. if I input rank = [0,1]. it gives an error rank should an integer.
Thanks for your fast response.
I didnt get where to add this line
When you run you program in the command line you can prepend it before python train.py:
pbelevich@pbelevich-mbp ~ % PL_TORCH_DISTRIBUTED_BACKEND=gloo python train.py
thank you so much! it worked! For some reason the training is slower than one gpu even though two gpus are being used!