Pytorch multi gpu training

I’m trying to run 2 gpu with torch tranning.

from apex.parallel import DistributedDataParallel as DDP

   #multi gpu
    os.environ["CUDA_VISIBLE_DEVICES"] = '0, 1'
    
    torch.distributed.init_process_group(backend='nccl',
                                             init_method='env://')
    #model = nn.DataParallel(model, output_device=1)
    model = DDP(model, delay_allreduce=True)

I added this init_process_group and DDP part

  1. this is not running without error
  2. and I dont know what backend and init_method mean

@Sangwon_Jake

Checkout this doc and this tutorial.

The init_method arg helps process groups to perform rendezvous. You can use either of the following two options:

  1. set with a string. E.g., for local training, you can specify sth like tcp://localhost:23456 to tell all processes to rendezvous on that port.
  2. set MASTER_ADDR and MASTER_PORT env vars.

BTW, for questions regarding distributed training, please add a “distributed” tag, so that the corresponding team can closely monitor that.