How can i gather tensors from all gpus in one machine(Windows with gloo backend))?

I use only one machine with multigpu to train. In init_process_group(), i set world_size==1 and rank==0. I am not very sure that whether it is right for multigpu in one node. It seems that the gpu usage is fine(about 100% in 2 gpu). But when i want to gather the same tensor in different gpus, dist.gather and dist.all_gather doesn’t work, the error like below when running dist.gather:
ValueError: ProcessGroupGloo::gather: Incorrect output list size 2. Output list size should be 1, same as size of the process group.

The environment is Windows, torch 1.7.1, gloo backend, cuda10.2.

Hi, if you’re aiming to do distributed training across 2 GPUs you’d want to set world_size to 2 and the corresponding ranks would be 0 and. 1.

The GPU utilization is at 100% across both GPUs possibly because both GPUs are operating independently with no distributed coordination between them.

Your call to all_gather is the right approach and the error message confirms that world_size is set incorrectly. Fixing the world_size issue should unblock your use case.