I use only one machine with multigpu to train. In init_process_group(), i set world_size==1 and rank==0. I am not very sure that whether it is right for multigpu in one node. It seems that the gpu usage is fine(about 100% in 2 gpu). But when i want to gather the same tensor in different gpus, dist.gather and dist.all_gather doesn’t work, the error like below when running dist.gather:
ValueError: ProcessGroupGloo::gather: Incorrect output list size 2. Output list size should be 1, same as size of the process group.
The environment is Windows, torch 1.7.1, gloo backend, cuda10.2.