Multiple Processes Per GPU?

I am training a model that does not make full use of the GPU’s compute and memory. Training is carried out over two 2080ti GPUs using Distributed DataParallel.

How can we concurrently train 2 models per GPU (each using different parameters), so that we can more fully utilize the GPs?

The following code currently trains only 1 model across 2 GPUs.

import torch.multiprocessing
import torch.distributed
import torch.nn as nn

def train(gpu, args):
    distributed.init_process_group(
        backend='nccl',
        init_method='env://',
        world_size=args['world_size'],
        rank=args['nr']*args['gpu']+gpu

    ...

    torch.cuda.set_device(gpu)
    model.cuda(gpu)
    model = nn.parallel.DistributedDataParallel(model, device_ids=[gpu])

    # training loop
    for epoch in range(num_epochs):
        ...

if __name__ == '__main__':
    multiprocessing.spawn(train, nprocs=2, args=(args,))

One possibility is

  1. use the new_group API in torch.distributed to create a different process group for two different models,
  2. Create different DistributedDataParallel instances, one for each wrapper and pass the process group object explicitly to DistributedDataParallel constructor (process_group arg) instead of using the default one.

In this way, DistributedDataParallel’s allreduce operations will not collide.