Hi,
I’m training a model that doesn’t use the full GPU, but I need to train on multiple GPUs. Is there any way to use torch.distributed
to accomplish this?
To illustrate, I’ve the following example. It runs on a node with 8GPUs, but I would like to run 12 processes. I launch this using
torchrun --nproc_per_node=12 --standalone gpu_reuse_test.py
where the file gpu_reuse_test.py
is as follows:
import os
from torchvision.models import resnet34
import torch.distributed as dist
import torch
import torch.nn as nn
def main2():
global_rank = dist.get_rank()
local_rank = global_rank % 8
torch.cuda.set_device(local_rank)
def init_weights(m):
if isinstance(m, nn.Linear):
torch.nn.init.xavier_uniform_(m.weight)
m.bias.data.fill_(0.01 * global_rank)
a = resnet34().cuda()
a.apply(init_weights)
sync_model(a)
print(f"Local_rank: {local_rank}")
def sync_model(model):
for p in model.parameters():
dist.all_reduce(p)
if __name__ == "__main__":
local_rank = int(os.environ['LOCAL_RANK']) % 8
dist.init_process_group('nccl', rank=local_rank, world_size=int(os.environ['WORLD_SIZE']))
main2()
The code fails with return TCPStore( RuntimeError: Address already in use
error.
If I replace the init_process_group
call with a plain dist.init_process_group('nccl')
, the all_reduce
call fails with
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1634272068694/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled system error, NCCL version 21.0.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
What should I do to fix this error? Can someone help me with this?