Reusing a GPU by multiple distributed processes

prabhuteja12 · November 20, 2021, 8:47pm

Hi,
I’m training a model that doesn’t use the full GPU, but I need to train on multiple GPUs. Is there any way to use torch.distributed to accomplish this?

To illustrate, I’ve the following example. It runs on a node with 8GPUs, but I would like to run 12 processes. I launch this using

torchrun --nproc_per_node=12 --standalone gpu_reuse_test.py

where the file gpu_reuse_test.py is as follows:

import os
from torchvision.models import resnet34
import torch.distributed as dist
import torch
import torch.nn as nn

def main2():
    global_rank = dist.get_rank()
    local_rank = global_rank % 8
    torch.cuda.set_device(local_rank)
    def init_weights(m):
        if isinstance(m, nn.Linear):
            torch.nn.init.xavier_uniform_(m.weight)
            m.bias.data.fill_(0.01 * global_rank)

    a = resnet34().cuda()
    a.apply(init_weights)
    sync_model(a)

    print(f"Local_rank: {local_rank}")


def sync_model(model):
    for p in model.parameters():
        dist.all_reduce(p)

if __name__ == "__main__":
    local_rank = int(os.environ['LOCAL_RANK']) % 8
    dist.init_process_group('nccl', rank=local_rank, world_size=int(os.environ['WORLD_SIZE']))
    main2()

The code fails with return TCPStore( RuntimeError: Address already in use error.

If I replace the init_process_group call with a plain dist.init_process_group('nccl'), the all_reduce call fails with

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1634272068694/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled system error, NCCL version 21.0.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

What should I do to fix this error? Can someone help me with this?

rvarm1 · November 22, 2021, 8:13pm

Hi, the distributed package assumes that the user will use one process per GPU, so you’d want to use 8 processes in this case. Is there any reason you’re trying to use 12 processes for 8 GPUs? The underlying root cause is that libraries like NCCL won’t work well if multiple processes try to use the same GPU, resulting in deadlocks, hangs etc.

prabhuteja12 · November 22, 2021, 8:26pm

I’m trying to benchmark some algorithms which need more than 8 workers, but I have only 8 GPUs for myself.