Use of DistributedDataParallel

parser.add_argument('--dist_url', default="env://", type=str)
parser.add_argument('--rank', type=int)
parser.add_argument('--gpu_to_work_on', type=int)
params = parser.parse_args()

def example():
    from torch.nn.parallel import DistributedDataParallel as DDP

    init_dist(args)

    model = nn.Linear(10, 10).cuda(params.gpu_to_work_on)
    ddp_model = DDP(model, device_ids=[params.gpu_to_work_on])

    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    outputs = ddp_model(torch.randn(20, 10).cuda(params.gpu_to_work_on))
    labels = torch.randn(20, 10).cuda(params.gpu_to_work_on)

    loss_fn(outputs, labels).backward()
    optimizer.step()


def init_dist(params):
    params.rank = int(os.environ["RANK"])
    params.world_size = int(os.environ["WORLD_SIZE"])
    dist.init_process_group(
        backend="nccl",
        init_method=params.dist_url,
        world_size=params.world_size,
        rank=params.rank,
    )
    params.gpu_to_work_on = params.rank % torch.cuda.device_count()
    print('rank:', params.rank)
    print('gpu_to_work_on:', params.gpu_to_work_on)
    print('n_gpus:', torch.cuda.device_count())
    torch.cuda.set_device(params.gpu_to_work_on)
    return

if __name__ == '__main__':
    example()

python -m torch.distributed.launch main.py
Am I correct using DistributedDataParallel?

Hi oasjd7. It looks like you have done what’s required to use DDP. You have initialized the process group and created a DDP model. You are using the torch.distributed.launch to run your example. Are you having problems with your example code?

1 Like

@gcramer23 Thanks for your reply. There was no error but, I just wondered that my code is working fine as I intended.

I use only 32 batch_size and I don’t need to size up the batch. I just want to more speed. In this situation, does DistributedDataParallel can help me to speed up? (I have 4-gpus 12,000MiB per gpu, and my code needs at least 40,000MiB)

I use only 32 batch_size and I don’t need to size up the batch. I just want to more speed. In this situation, does DistributedDataParallel can help me to speed up? (I have 4-gpus 12,000MiB per gpu, and my code needs at least 40,000MiB)

Yes DDP can improve training speed. It will need to be configured correctly. You can benchmark your configurations to see what works best for your use case.

This paper provides helpful information on this topic Log into Facebook.