Try to train in multi-gpu environment using distributed library but runs on only single gpu

the library i used is same as below:

from distributed import apply_gradient_allreduce
import torch.distributed as dist
from torch.utils.data.distributed import DistributedSampler
from torch.utils.data import DataLoader

my init_distributed function is following:

def init_distributed(args, n_gpus, rank, group_name):
assert torch.cuda.is_available(), “Distributed mode requires CUDA.”
print(“Initializing distributed”)
# Set cuda device so everything is done on the right GPU.
torch.cuda.set_device(rank % torch.cuda.device_count())
# Initialize distributed communication
torch.distributed.init_process_group(
backend=args.dist_backend, init_method=args.dist_url,
world_size=n_gpus, rank=rank, group_name=group_name)
print(“Done initializing distributed”)

when I make a training but the code works on only one gpu with two processes.
How can I fix this problem?

image

Not familiar with the torch.distributed thingy, but I’m pretty sure you also need to set your model to use multiple GPUs, otherwise it will default to only one GPU.

See the tutorials here.

I think it is due to the following -
torch.cuda.set_device(rank % torch.cuda.device_count())

Check the following:

  1. case to execute model on specific gpu https://github.com/pytorch/examples/blob/master/imagenet/main.py#L144
  2. case to execute on all available gpus
    https://github.com/pytorch/examples/blob/master/imagenet/main.py#L154