Distributed Training with docker Container by splitting gpus resource for two instances

Hello,
I have a 8gpu server for training and use docker to run my experiments. I wanted to use first 4-gpu with one container for setting 1 of the experiment and the last 4-gpus with another container for a different setting of the experiment. But when I try to do that I get ‘Runtime error: Address already in use’ error. I tried giving different port for each of the runs but still the same issue. Any pointers to help me resolve it?

Hi, if you are using different ports for each run it will be fine, not sure why you are hitting this error. Can you provide more information about the initialization? Are you using init_process_group, what are you setting for initialization. Additionally, can you post the backtrace of the error?

Hi,
I’m currently setting different ports…below is the trace:


File "deepspeedtrain_pyDL.py", line 345, in __init__
    torch.distributed.init_process_group(backend="nccl", **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 229, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 158, in _create_c10d_store
    hostname, port, world_size, start_daemon, timeout, multi_tenant=True
RuntimeError: Address already in use

I had initialised the global vairables for port, local rank, rank as below:

os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "5653"  
os.environ["RANK"] = "0"
os.environ["LOCAL_RANK"] = "0"

I’m not sure why i’m getting this error…Is there a github example I can refer to for setting up the initialization?

From what I can tell, it is attempting to create a TCPStore instance internally. You need to find out what is currently conflicting with the address. As a test, you can try to create the store manually and this should also fail:

import torch.distributed as dist
# Create a single store instance
server_store = dist.TCPStore("127.0.0.1", 5653, 1, True)

Perhaps you can put print statements before "deepspeedtrain_pyDL.py", line 345, in __init__ torch.distributed.init_process_group(backend="nccl", **kwargs) to print out the environment variables MASTER_ADDR, MASTER_PORT just to confirm.

What version of PyTorch are you using? I believe this error message has been updated in the newer versions to be more descriptive of where the port/address conflict is (try pytorch 1.10+)