Single-machine, single-GPU: distributed best practices

I’ve been reading the docs about PyTorch’s distributed features such as torch.distributed and DDP models. It seems like these features are geared towards multi-node and/or multi-GPU settings, so I’m wondering if there is a set of best practices for the simplest case: multiprocessing using one CPU and one GPU.

Should the distributed package be used at all? Are there overheads that one should worry about? The reason that this package appeals to me is that I need to do a reduce operation at the end of each RL episode, which is easily handled by dist.all_reduce. And DDP also makes model updates super clean. I know that I can purely use torch.multiprocessing, but I guess the code would be messier.

Could asynchronous, DDP model training with a single GPU be sped up by having each subprocess use a different CUDA stream? My thinking is that it will allow multiple subprocesses run CUDA kernels in parallel, rather than queuing them up.

I don’t think DDP supports single GPU use cases as it would assume multiple NCCL ProcessGroups could communicate on the same device (I don’t know if this is supported in GLOO or another backend), so using different streams might be the right approach for you.

Each process would already be a separate Python process and would not share any CUDA streams.

Parallel execution requires free compute resources. If one kernel uses all compute resources you won’t be able to execute anything in parallel.

DDP is working with the GLOO backend and single device for me:

''' in main process '''
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '29500'

device = torch.device('cuda:0')

# model object sent to each subprocess
model = ...

''' in each subprocess '''
dist.init_process_group('gloo', rank=rank, world_size=num_envs)

# use the same device in each instance of DDP
model = DDP(model, device_ids=[device])

I see! Didn’t realize that different processes use different streams.