Multiprocessing failed with Torch.distributed.launch module

Hi, I have met a situation which was little different with @leo-mao. I want to train my model using one machine (node) which has multi-GPUs with torch.distributed. Since each time I just used 2 GPUs, I want to run several models at the same time. The problem is when I have started one model running with torch.distributed, others will get an error info " RuntimeError: Address already in use ". I set the initial way as @teng-li says. Maybe I ignore something like port setting ? I am confused about it. I will be very appreciate if someone can give me some suggestions.


Got the same problem as @Lausanne has. Any thought into this?

@memray Hi, I have solved my questions. The reason why this bug happened is that two programme used the same port. So my solution is using random port in your command line.
For example, you can write your sh command as " python -m torch.distributed.launch --nproc_per_node=$NGPUS --master_port=$RANDOM ". Just use random number to occupy port. Hope my finding can solve your problem.


Hi, I am working on distributed.launch module recently, I have some question.

  1. I think with the launch and distributedDataParallel(model), you don’t need to average grads manually.
    2.During your training, does your gpu0 have more memory usage than the other gpus? I found that the other gpus have extra memory usage in gpu0, it’s annoying.

@zeal Regarding 1, yes you don’t need to manually average gradients. Regarding 2, this is possible if you have some code that somehow uses GPU 0 at some time during execution of your program. This is not an invariant of using the distributed data parallel module.

It’s weird. I found out it seems like the Gpu cache release problem of pytorch. I add ‘torch.cuda.empty_cached’ in somewhere of my code and every gpu have same memory usage. But the program runs rather slower since the empty_cached was add in a for loop.
I still cannot found out what’s wrong. Does it in theory, if you use distributed training, every gpu will have the same memory usage? I know that if you use dataparallel module, the gpu0 will have more memory consumtion.

Thanks. worked for me.

1 Like

Worked for me! Thanks!! :smile:

1 Like

Worked for me. Thanks.

1 Like

I have tried to donot set the rank and world_size, but it shows that “ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set”

os.environ['MASTER_ADDR'] = ''
os.environ['MASTER_PORT'] = '2000901'

#dist.init_process_group(backend, rank=rank, world_size=size)

Do you have any idea what’s this comes from?

Could you please provide an example script to reproduce this, and the arguments that you’re passing in to DDP? thanks!

Hi sorry for the late reply. I have solved the issues.

1 Like

How did solves your problem?

Sometimes it may cause Permission Denied. Just change the RANDOM to a different int, then works for me!


I got same error after same trial. How did you solve it after removing ‘world_size’ and ‘rank’ parameter in dist.init_process_group()??

why does your solution work? my code now complained with this:

$ python playground/multiprocessing_playground/
# gpus 2

Start running DDP with model parallel example on rank: 0.
current process: <SpawnProcess name='SpawnProcess-1' parent=1863 started>
pid: 1890

Start running DDP with model parallel example on rank: 1.
current process: <SpawnProcess name='SpawnProcess-2' parent=1863 started>
pid: 1892
Traceback (most recent call last):
  File "playground/multiprocessing_playground/", line 115, in <module>
  File "playground/multiprocessing_playground/", line 30, in main
    mp.spawn(train, nprocs=args.gpus, args=(args,))
  File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/", line 157, in start_processes
    while not context.join():
  File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/", line 118, in join
    raise Exception(msg)

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/", line 19, in _wrap
    fn(i, *args)
  File "/home/miranda9/ML4Coq/playground/multiprocessing_playground/", line 64, in train
    dist.init_process_group(backend='nccl', init_method='env://')
  File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/distributed/", line 423, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/distributed/", line 155, in _env_rendezvous_handler
    raise _env_error("RANK")
ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set

self contained code:

import os
from datetime import datetime
import argparse
import torch.multiprocessing as mp
import torchvision
import torchvision.transforms as transforms
import torch
import torch.nn as nn
import torch.distributed as dist
# from apex.parallel import DistributedDataParallel as DDP
# from apex import amp

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('-n', '--nodes', default=1, type=int, metavar='N',
                        help='number of data loading workers (default: 4)')
    parser.add_argument('-g', '--gpus', default=1, type=int,
                        help='number of gpus per node')
    parser.add_argument('-nr', '--nr', default=0, type=int,
                        help='ranking within the nodes')
    parser.add_argument('--epochs', default=2, type=int, metavar='N',
                        help='number of total epochs to run')
    args = parser.parse_args()
    args.gpus = torch.cuda.device_count()
    args.world_size = args.gpus * args.nodes
    os.environ['MASTER_ADDR'] = ''
    # os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '8888'
    mp.spawn(train, nprocs=args.gpus, args=(args,))

class ConvNet(nn.Module):
    def __init__(self, num_classes=10):
        super(ConvNet, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.layer2 = nn.Sequential(
            nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.fc = nn.Linear(7*7*32, num_classes)

    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = out.reshape(out.size(0), -1)
        out = self.fc(out)
        return out

def train(gpu, args):
    print(f"Start running DDP with model parallel example on rank: {gpu}.")
    print(f'current process: {mp.current_process()}')
    print(f'pid: {os.getpid()}')

    rank = * args.gpus + gpu
    # dist.init_process_group(backend='nccl', init_method='env://', world_size=args.world_size, rank=rank)
    # dist.init_process_group(backend='nccl', init_method='env://')
    model = ConvNet()
    batch_size = 100
    # define loss function (criterion) and optimizer
    criterion = nn.CrossEntropyLoss().cuda(gpu)
    optimizer = torch.optim.SGD(model.parameters(), 1e-4)
    # Wrap the model
    model = nn.parallel.DistributedDataParallel(model, device_ids=[gpu])
    # Data loading code
    train_dataset = torchvision.datasets.MNIST(root='./data',
    train_sampler =,
    train_loader =,

    start =
    total_step = len(train_loader)
    for epoch in range(args.epochs):
        for i, (images, labels) in enumerate(train_loader):
            images = images.cuda(non_blocking=True)
            labels = labels.cuda(non_blocking=True)
            # Forward pass
            outputs = model(images)
            loss = criterion(outputs, labels)

            # Backward and optimize
            if (i + 1) % 100 == 0 and gpu == 0:
                print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, args.epochs, i + 1, total_step,
    if gpu == 0:
        print("Training complete in: " + str( - start))


if __name__ == '__main__':
    print(f'# gpus {torch.cuda.device_count()}')

Your suggestion worked for me but the random port was not working some times. So, I generated available port number by python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()'


I think this might also be useful to find free ports (better than random):

def find_free_port():
    """ """
    import socket
    from contextlib import closing

    with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s:
        s.bind(('', 0))
        s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        return str(s.getsockname()[1])

then do (is my guess)

init_method = f"localhost:{find_free_port()}"
1 Like

@smth How do I get the flag within my python script that I am passing to torchrun? I want to set the number --nproc_per_node=32 I am passing there automatically rather than making sure the two scripts match (note I want to set the world size myself e.g. I am using cpu parallel jobs and want to choose that value myself thus)

Hi, I am getting the same error, but when I removed the world_size and rank, I got this error.
ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set