How to set backend to 'gloo' on windows

Mo_Balut · September 15, 2022, 11:56am

I am trying to use two gpus on my windows machine, but I keep getting
raise RuntimeError("Distributed package doesn't have NCCL " "built in") RuntimeError: Distributed package doesn't have NCCL built in
I am still new to pytorch and couldnt really find a way of setting the backend to ‘gloo’. I followed this link by setting the following but still no luck.

As NLCC is not available on windows I had to tweak the ‘setup_devices’ method of ‘training_args.py’ and write:

torch.distributed.init_process_group(backend=“nccl”) → torch.distributed.init_process_group(backend=“gloo”)

along with the ‘distributed_concat’ in ‘trainer_pt_utils.py’:

dist.all_gather(output_tensors, tensor) → dist.all_gather(output_tensors, tensor if len(tensor.shape) > 0 else tensor[None])

How to set backend=‘gloo’ and from where?

pbelevich · September 16, 2022, 3:58pm

@Mo_Balut could you please show your code?

torch.distributed.init_process_group(backend=“gloo”) is the right way to use gloo

Mo_Balut · September 16, 2022, 4:43pm

thx for replying!

    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'
    torch.distributed.init_process_group(backend='gloo',rank=0, world_size=2)
    trainer = pl.Trainer.from_argparse_args(args, checkpoint_callback=checkpoint_callback)

    model = T5Finetuner(hparams=args,
                        train_dataloader=train_dataloader,
                        val_dataloader=val_dataloader,
                        test_dataloader=test_dataloader,
                        orthography=args.orthography)

    trainer.fit(model)

I am actually not sure what to put on rank. This code doesnt give an error but the command prompt freezes. I am using two gpus. What would be my rank. if I input rank = [0,1]. it gives an error rank should an integer.

pbelevich · September 16, 2022, 5:16pm

If you use PyTorch Lightning you should do the following:

https://pytorch-lightning.readthedocs.io/en/1.4.0/advanced/multi_gpu.html#select-torch-distributed-backend

Mo_Balut · September 17, 2022, 9:30am

Thanks for your fast response.
I didnt get where to add this line PL_TORCH_DISTRIBUTED_BACKEND=gloo

pbelevich · September 19, 2022, 4:16pm

When you run you program in the command line you can prepend it before python train.py:

pbelevich@pbelevich-mbp ~ % PL_TORCH_DISTRIBUTED_BACKEND=gloo python train.py

mba · September 19, 2022, 6:02pm

thank you so much! it worked! For some reason the training is slower than one gpu even though two gpus are being used!

polyteddy · December 19, 2022, 11:41pm

I used this in my script to try starting a training session in mmdetection in Windows and received an error:

RuntimeError: trying to initialize the default process group twice!

Would you kindly advise?

Muhammad_Ahtesham_Ul · March 29, 2023, 5:32pm

Please! Can you help me run alpaca model over amazon sage maker?
I can pay you as well