Reusing a GPU by multiple distributed processes

I’m training a model that doesn’t use the full GPU, but I need to train on multiple GPUs. Is there any way to use torch.distributed to accomplish this?

To illustrate, I’ve the following example. It runs on a node with 8GPUs, but I would like to run 12 processes. I launch this using

torchrun --nproc_per_node=12 --standalone

where the file is as follows:

import os
from torchvision.models import resnet34
import torch.distributed as dist
import torch
import torch.nn as nn

def main2():
    global_rank = dist.get_rank()
    local_rank = global_rank % 8
    def init_weights(m):
        if isinstance(m, nn.Linear):
   * global_rank)

    a = resnet34().cuda()

    print(f"Local_rank: {local_rank}")

def sync_model(model):
    for p in model.parameters():

if __name__ == "__main__":
    local_rank = int(os.environ['LOCAL_RANK']) % 8
    dist.init_process_group('nccl', rank=local_rank, world_size=int(os.environ['WORLD_SIZE']))

The code fails with return TCPStore( RuntimeError: Address already in use error.

If I replace the init_process_group call with a plain dist.init_process_group('nccl'), the all_reduce call fails with

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1634272068694/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled system error, NCCL version 21.0.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

What should I do to fix this error? Can someone help me with this?

Hi, the distributed package assumes that the user will use one process per GPU, so you’d want to use 8 processes in this case. Is there any reason you’re trying to use 12 processes for 8 GPUs? The underlying root cause is that libraries like NCCL won’t work well if multiple processes try to use the same GPU, resulting in deadlocks, hangs etc.

I’m trying to benchmark some algorithms which need more than 8 workers, but I have only 8 GPUs for myself.