Multiprocessing failed with Torch.distributed.launch module

leo-mao · December 26, 2018, 2:21am

During training MNIST example dataset in PyTorch, I met the this RuntimeError on the Master node

File "./torch-dist/mnist-dist.py", line 201, in <module>
    init_processes(args.rank, args.world_size, run, args.batch_size, backend=args.backend)
  File "./torch-dist/mnist-dist.py", line 196, in init_processes
    dist.init_process_group(backend=backend, world_size=world_size, rank=rank, init_method="env://")
  File "/home/dl/anaconda2/envs/torch-dist-py3.6/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 354, in init_process_group
    store, rank, world_size = next(rendezvous(url))
  File "/home/dl/anaconda2/envs/torch-dist-py3.6/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 143, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, start_daemon)
RuntimeError: Address already in use

It seemed that multiprocessing with launch utility had problem
I ran the code by invoking the launch utility as documentation suggested

python -m torch.distributed.launch --nproc_per_node=2 --nnode=2 --node_rank=0 --master_addr='10.0.3.29' --master_port=9901 ./torch-dist/mnist-dist.py

leo-mao · December 26, 2018, 2:22am

Here is my code, hope it help

import argparse
import time
import torch

import torch.nn as nn
import torch.nn.functional as F
import torch.backends.cudnn as cudnn
import torch.distributed as dist
import torch.utils.data
import torch.utils.data.distributed
import torch.optim as optim

import torch.distributed


from torchvision import datasets, transforms
from torch.autograd import Variable

parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
parser.add_argument('--batch-size', type=int, default=256, metavar='N',
                    help='input batch size for training (default: 64)')
parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
                    help='input batch size for testing (default: 1000)')
parser.add_argument('--epochs', type=int, default=5, metavar='N', help='number of epochs to train (default: 10)')

parser.add_argument('--lr', type=float, default=0.01, metavar='LR', help='learning rate (default: 0.01)')
parser.add_argument('--momentum', type=float, default=0.5, metavar='M', help='SGD momentum (default: 0.5)')
parser.add_argument('--no-cuda', action='store_true', default=False, help='disables CUDA training')
parser.add_argument('--seed', type=int, default=1, metavar='S', help='random seed (default: 1)')
parser.add_argument('--log-interval', type=int, default=10, metavar='N',
                    help='how many batches to wait before logging training status')
parser.add_argument('--backend', type=str, default='nccl')
parser.add_argument('--rank', type=int, default=0)
parser.add_argument('--world-size', type=int, default=1)
parser.add_argument('--local_rank', type=int)
args = parser.parse_args()
args.cuda = not args.no_cuda and torch.cuda.is_available()


class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x, dim=0)


def average_gradients(model):
    """ Gradient averaging"""
    size = float(dist.get_world_size())

    for param in model.parameters():
        dist.all_reduce_multigpu(param.grad.data, op=dist.ReduceOp.SUM)
        param.grad.data /= size


def summary_print(rank, loss, accuracy, average_epoch_time, tot_time):
    import logging
    size = float(dist.get_world_size())
    summaries = torch.tensor([loss, accuracy, average_epoch_time, tot_time], requires_grad=False, device='cuda')
    dist.reduce_multigpu(summaries, 0, op=dist.ReduceOp.SUM)
    if rank == 0:
        summaries /= size
        logging.critical('\n[Summary]System : Average epoch time(ex. 1.): {:.2f}s, Average total time : {:.2f}s '
                         'Average loss: {:.4f}\n, Average accuracy: {:.2f}%'
                         .format(summaries[2], summaries[3], summaries[0], summaries[1] * 100))


def train(model, optimizer, train_loader, epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        if args.cuda:
            data, target = data.cuda(), target.cuda()
        data, target = Variable(data), Variable(target)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        if args.world_size > 1:
            average_gradients(model)
        if batch_idx % args.log_interval == 0:
            print('Train Epoch {} - {} / {:3.0f} \tLoss  {:.6f}'.format(
                epoch, batch_idx, 1.0 * len(train_loader.dataset) / len(data), loss))


def test(test_loader, model):
    model.eval()
    test_loss = 0
    correct = 0
    for data, target in test_loader:
        if args.cuda:
            data, target = data.cuda(), target.cuda()
        # Varibale(data, volatile=True)
        data, target = Variable(data, requires_grad=False), Variable(target)
        output = model(data)
        test_loss += F.nll_loss(output, target, reduction='sum')
        pred = output.data.max(1, keepdim=True)[1]
        correct += pred.eq(target.data.view_as(pred)).cpu().sum()

    test_loss /= len(test_loader.dataset)
    print('\nTest set : Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'
          .format(test_loss, correct, len(test_loader.dataset),
                  100. * correct / len(test_loader.dataset)))
    return test_loss, float(correct) / len(test_loader.dataset)


def config_print(rank, batch_size, world_size):
    print('----Torch Config----')
    print('rank : {}'.format(rank))
    print('mini batch-size : {}'.format(batch_size))
    print('world-size : {}'.format(world_size))
    print('backend : {}'.format(args.backend))
    print('--------------------')


def run(rank, batch_size, world_size):
    """ Distributed Synchronous SGD Example """
    config_print(rank, batch_size, world_size)

    train_dataset = datasets.MNIST('../MNIST_data/', train=True,
                                   transform=transforms.Compose([transforms.ToTensor(),
                                                                 transforms.Normalize((0.1307,), (0.3081,))]))

    train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset, num_replicas=world_size,
                                                                    rank=rank)

    kwargs = {'num_workers': args.world_size, 'pin_memory': True} if args.cuda else {}

    train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True, **kwargs)
    test_loader = torch.utils.data.DataLoader(datasets.MNIST('../MNIST_data/', train=False,
                                                             transform=transforms.Compose(
                                                                 [transforms.ToTensor(),
                                                                  transforms.Normalize((0.1307,), (0.3081,))])),
                                              batch_size=args.test_batch_size, shuffle=True, **kwargs)

    model = Net()

    if args.cuda:
        torch.cuda.manual_seed(args.seed)
        torch.cuda.set_device(args.local_rank)
        device = torch.device('cuda', args.local_rank)
        model.cuda(device=device)
        model = nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank], output_device=args.local_rank)
        cudnn.benchmark = True
    else:
        device = torch.device('cpu')

    optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)

    torch.manual_seed(args.seed)
    tot_time = 0
    first_epoch = 0

    for epoch in range(1, args.epochs + 1):
        train_sampler.set_epoch(epoch)
        start_cpu_secs = time.time()
        train(model, optimizer, train_loader, epoch)
        end_cpu_secs = time.time()
        # print('start_cpu_secs {}'.format())
        print("Epoch {} of took {:.3f}s".format(
            epoch, end_cpu_secs - start_cpu_secs))

        tot_time += end_cpu_secs - start_cpu_secs
        print('Current Total time : {:.3f}s'.format(tot_time))
        if epoch == 1:
            first_epoch = tot_time

    test_loss, accuracy = test(test_loader, model)

    if args.epochs > 1:
        average_epoch_time = float(tot_time - first_epoch) / (args.epochs - 1)
        print('Average epoch time(ex. 1.) : {:.3f}s'.format(average_epoch_time))
        print("Total time : {:.3f}s".format(tot_time))
        if args.world_size > 1:
            summary_print(rank, test_loss, accuracy, average_epoch_time, tot_time)


def init_processes(rank, world_size, fn, batch_size, backend='gloo'):
    import os
    os.environ['MASTER_ADDR'] = '10.0.3.29'
    os.environ['MASTER_PORT'] = '9901'
    os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'
    os.environ['NCCL_DEBUG'] = 'INFO'
    os.environ['GLOO_SOCKET_IFNAME'] = 'enp0s31f6'
    dist.init_process_group(backend=backend, world_size=world_size, rank=rank, init_method="env://")
    fn(rank, batch_size, world_size)


if __name__ == '__main__':
    init_processes(args.rank, args.world_size, run, args.batch_size, backend=args.backend)
    torch.multiprocessing.set_start_method('spawn')
    # processes = []
    # for rank in range(1):
    #     p = Process(target=init_processes,
    #                 args=(rank, args.world_size, run, args.batch_size, args.backend))
    #     p.start()
    #     processes.append(p)
    #
    # for p in processes:
    #     p.join()

smth · December 28, 2018, 11:21pm

check ps -elf | grep python, and see if you have any processes from previous runs that still have not been killed. Maybe they are occupying that port and are still alive.

leo-mao · December 29, 2018, 9:18am

Thx for reply, no background process was found and the port was always available.

I have fixed a typo in my command, where --master_addr ='10.0.3.29' --master_port=9901 had been typed as --master_addr ='10.0.3.29' --master_port='10.0.3.29'.

However the error has remained and at least one process has launched with the output

Traceback (most recent call last):
  File "mnist-dist.py", line 203, in <module>
    init_processes(args.rank, args.world_size, run, args.batch_size, backend=args.backend)
  File "mnist-dist.py", line 198, in init_processes
    dist.init_process_group(backend=backend, world_size=world_size, rank=rank, init_method="env://")
  File "/home/dl/anaconda2/envs/torch-dist-py3.6/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 354, in init_process_group
    store, rank, world_size = next(rendezvous(url))
  File "/home/dl/anaconda2/envs/torch-dist-py3.6/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 143, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, start_daemon)
RuntimeError: Address already in use
dl-Server:10648:10648 [1] NCCL INFO NET : Using interface enp0s31f6:10.0.13.29<0>
dl-Server:10648:10648 [1] NCCL INFO NET/IB : Using interface enp0s31f6 for sideband communication
dl-Server:10648:10648 [1] NCCL INFO Using internal Network Socket
dl-Server:10648:10648 [1] NCCL INFO NET : Using interface enp0s31f6:10.0.13.29<0>
dl-Server:10648:10648 [1] NCCL INFO NET/Socket : 1 interfaces found
NCCL version 2.3.7+cuda9.0
dl-Server:10648:10648 [1] NCCL INFO rank 0 nranks 1
dl-Server:10648:10675 [1] NCCL INFO comm 0x7fd38c00d3f0 rank 0 nranks 1
dl-Server:10648:10675 [1] NCCL INFO CUDA Dev 1, IP Interfaces : enp0s31f6(PHB) 
dl-Server:10648:10675 [1] NCCL INFO Using 256 threads
dl-Server:10648:10675 [1] NCCL INFO Min Comp Cap 6
dl-Server:10648:10675 [1] NCCL INFO comm 0x7fd38c00d3f0 rank 0 nranks 1 - COMPLETE
Train Epoch 1 - 0 / 234         Loss  5.576234
Train Epoch 1 - 10 / 234        Loss  5.541703
Train Epoch 1 - 20 / 234        Loss  5.515630
Train Epoch 1 - 30 / 234        Loss  5.514045
Train Epoch 1 - 40 / 234        Loss  5.485974
Train Epoch 1 - 50 / 234        Loss  5.462833
Train Epoch 1 - 60 / 234        Loss  5.422739
Train Epoch 1 - 70 / 234        Loss  5.374931
Train Epoch 1 - 80 / 234        Loss  5.342307
Train Epoch 1 - 90 / 234        Loss  5.291063
Train Epoch 1 - 100 / 234       Loss  5.220443
Train Epoch 1 - 110 / 234       Loss  5.083968
Train Epoch 1 - 120 / 234       Loss  5.002171
Train Epoch 1 - 130 / 234       Loss  4.953607
Train Epoch 1 - 140 / 234       Loss  4.894170
Train Epoch 1 - 150 / 234       Loss  4.805832
Train Epoch 1 - 160 / 234       Loss  4.792961
Train Epoch 1 - 170 / 234       Loss  4.732522
Train Epoch 1 - 180 / 234       Loss  4.770869
Train Epoch 1 - 190 / 234       Loss  4.688779
Train Epoch 1 - 200 / 234       Loss  4.725927
Train Epoch 1 - 210 / 234       Loss  4.620460
Train Epoch 1 - 220 / 234       Loss  4.605740
Train Epoch 1 - 230 / 234       Loss  4.563363
Epoch 1 of took 5.263s
Current Total time : 5.263s
Train Epoch 2 - 0 / 234         Loss  4.555817
Train Epoch 2 - 10 / 234        Loss  4.603082
Train Epoch 2 - 20 / 234        Loss  4.618500
Train Epoch 2 - 30 / 234        Loss  4.520389
Train Epoch 2 - 40 / 234        Loss  4.531864
Train Epoch 2 - 50 / 234        Loss  4.467782
Train Epoch 2 - 60 / 234        Loss  4.447100
Train Epoch 2 - 70 / 234        Loss  4.424728
Train Epoch 2 - 80 / 234        Loss  4.433639
Train Epoch 2 - 90 / 234        Loss  4.372109
Train Epoch 2 - 100 / 234       Loss  4.435561
Train Epoch 2 - 110 / 234       Loss  4.351253
Train Epoch 2 - 120 / 234       Loss  4.306677
Train Epoch 2 - 130 / 234       Loss  4.343150
Train Epoch 2 - 140 / 234       Loss  4.243150
Train Epoch 2 - 150 / 234       Loss  4.347620
Train Epoch 2 - 160 / 234       Loss  4.217095
Train Epoch 2 - 170 / 234       Loss  4.255800
Train Epoch 2 - 180 / 234       Loss  4.282191
Train Epoch 2 - 190 / 234       Loss  4.249407
Train Epoch 2 - 200 / 234       Loss  4.209113
Train Epoch 2 - 210 / 234       Loss  4.194527
Train Epoch 2 - 220 / 234       Loss  4.220213
Train Epoch 2 - 230 / 234       Loss  4.201759
Epoch 2 of took 5.524s
Current Total time : 10.787s
Train Epoch 3 - 0 / 234         Loss  4.158279
Train Epoch 3 - 10 / 234        Loss  4.111032
Train Epoch 3 - 20 / 234        Loss  4.147989
Train Epoch 3 - 30 / 234        Loss  4.255434
Train Epoch 3 - 40 / 234        Loss  4.111946
Train Epoch 3 - 50 / 234        Loss  4.111733
Train Epoch 3 - 60 / 234        Loss  4.176547
Train Epoch 3 - 70 / 234        Loss  4.063233
Train Epoch 3 - 80 / 234        Loss  4.079793
Train Epoch 3 - 90 / 234        Loss  4.042555
Train Epoch 3 - 100 / 234       Loss  4.050662
Train Epoch 3 - 110 / 234       Loss  4.066662
Train Epoch 3 - 120 / 234       Loss  4.090621
Train Epoch 3 - 130 / 234       Loss  4.015823
Train Epoch 3 - 140 / 234       Loss  4.092526
Train Epoch 3 - 150 / 234       Loss  4.045942
Train Epoch 3 - 160 / 234       Loss  4.048071
Train Epoch 3 - 170 / 234       Loss  3.984233
Train Epoch 3 - 180 / 234       Loss  3.942847
Train Epoch 3 - 190 / 234       Loss  3.943717
Train Epoch 3 - 200 / 234       Loss  3.959996
Train Epoch 3 - 210 / 234       Loss  4.059554
Train Epoch 3 - 220 / 234       Loss  3.918130
Train Epoch 3 - 230 / 234       Loss  4.074725
Epoch 3 of took 5.308s
Current Total time : 16.095s
Train Epoch 4 - 0 / 234         Loss  3.944645
Train Epoch 4 - 10 / 234        Loss  3.923414
Train Epoch 4 - 20 / 234        Loss  3.944232
Train Epoch 4 - 30 / 234        Loss  3.978234
Train Epoch 4 - 40 / 234        Loss  3.950741
Train Epoch 4 - 50 / 234        Loss  3.913695
Train Epoch 4 - 60 / 234        Loss  3.907088
Train Epoch 4 - 70 / 234        Loss  4.026055
Train Epoch 4 - 80 / 234        Loss  3.854659
Train Epoch 4 - 90 / 234        Loss  3.954557
Train Epoch 4 - 100 / 234       Loss  3.880200
Train Epoch 4 - 110 / 234       Loss  3.911777
Train Epoch 4 - 120 / 234       Loss  3.866536
Train Epoch 4 - 130 / 234       Loss  3.957554
Train Epoch 4 - 140 / 234       Loss  3.930515
Train Epoch 4 - 150 / 234       Loss  3.950871
Train Epoch 4 - 160 / 234       Loss  3.845739
Train Epoch 4 - 170 / 234       Loss  3.905876
Train Epoch 4 - 180 / 234       Loss  3.884211
Train Epoch 4 - 190 / 234       Loss  4.034623
Train Epoch 4 - 200 / 234       Loss  3.863284
Train Epoch 4 - 210 / 234       Loss  3.899471
Train Epoch 4 - 220 / 234       Loss  3.837218
Train Epoch 4 - 230 / 234       Loss  3.862398
Epoch 4 of took 5.271s
Current Total time : 21.366s
Train Epoch 5 - 0 / 234         Loss  3.878444
Train Epoch 5 - 10 / 234        Loss  3.919256
Train Epoch 5 - 20 / 234        Loss  3.872842
Train Epoch 5 - 30 / 234        Loss  3.926296
Train Epoch 5 - 40 / 234        Loss  3.787506
Train Epoch 5 - 50 / 234        Loss  3.959824
Train Epoch 5 - 60 / 234        Loss  3.830777
Train Epoch 5 - 70 / 234        Loss  3.883856
Train Epoch 5 - 80 / 234        Loss  3.877614
Train Epoch 5 - 90 / 234        Loss  3.846863
Train Epoch 5 - 100 / 234       Loss  3.908530
Train Epoch 5 - 110 / 234       Loss  3.819784
Train Epoch 5 - 120 / 234       Loss  3.798816
Train Epoch 5 - 130 / 234       Loss  3.757388
Train Epoch 5 - 140 / 234       Loss  3.837136
Train Epoch 5 - 150 / 234       Loss  3.855000
Train Epoch 5 - 160 / 234       Loss  3.821057
Train Epoch 5 - 170 / 234       Loss  3.777124
Train Epoch 5 - 180 / 234       Loss  3.714392
Train Epoch 5 - 190 / 234       Loss  3.776406
Train Epoch 5 - 200 / 234       Loss  3.886733
Train Epoch 5 - 210 / 234       Loss  3.927509
Train Epoch 5 - 220 / 234       Loss  3.719052
Train Epoch 5 - 230 / 234       Loss  3.785564
Epoch 5 of took 5.216s
Current Total time : 26.582s

Test set : Average loss: 4.8523, Accuracy: 9453/10000 (94%)

Average epoch time(ex. 1.) : 5.330s
Total time : 26.582s

Would you like to have a look ?

leo-mao · December 29, 2018, 9:25am

Maybe the way I was using launch utility was to blame, do you have any idea? Was I suppose to type same command on the master node and other worker nodes?

Any suggestion would be welcome, since I’ve been stuck for too long.

smth · December 31, 2018, 1:27am

@leo-mao actually I have reproduced your issue, I am taking a look.

teng-li · January 3, 2019, 7:48pm

@leo-mao, you should not set world_size and rank in torch.distributed.init_process_group, they are automatically set by torch.distributed.launch.

So please change that to dist.init_process_group(backend=backend, init_method=“env://”)

Also, you should not set WORLD_SIZE, RANK env variables in your code either since they will be set by launch utility.

leo-mao · January 7, 2019, 7:07am

you are right, it works after I delete rank and world_size parameter in torch.distributed.init_process_group, thanks a lot

Lausanne · February 10, 2019, 2:21am

Hi, I have met a situation which was little different with @leo-mao. I want to train my model using one machine (node) which has multi-GPUs with torch.distributed. Since each time I just used 2 GPUs, I want to run several models at the same time. The problem is when I have started one model running with torch.distributed, others will get an error info " RuntimeError: Address already in use ". I set the initial way as @teng-li says. Maybe I ignore something like port setting ? I am confused about it. I will be very appreciate if someone can give me some suggestions.

memray · February 14, 2019, 4:43am

Got the same problem as @Lausanne has. Any thought into this?

Lausanne · February 26, 2019, 3:28am

@memray Hi, I have solved my questions. The reason why this bug happened is that two programme used the same port. So my solution is using random port in your command line.
For example, you can write your sh command as " python -m torch.distributed.launch --nproc_per_node=$NGPUS --master_port=$RANDOM train.py ". Just use random number to occupy port. Hope my finding can solve your problem.

zeal · March 19, 2019, 3:09am

Hi, I am working on distributed.launch module recently, I have some question.

I think with the launch and distributedDataParallel(model), you don’t need to average grads manually.
2.During your training, does your gpu0 have more memory usage than the other gpus? I found that the other gpus have extra memory usage in gpu0, it’s annoying.

pietern · March 19, 2019, 4:04pm

@zeal Regarding 1, yes you don’t need to manually average gradients. Regarding 2, this is possible if you have some code that somehow uses GPU 0 at some time during execution of your program. This is not an invariant of using the distributed data parallel module.

zeal · March 22, 2019, 3:18am

It’s weird. I found out it seems like the Gpu cache release problem of pytorch. I add ‘torch.cuda.empty_cached’ in somewhere of my code and every gpu have same memory usage. But the program runs rather slower since the empty_cached was add in a for loop.
I still cannot found out what’s wrong. Does it in theory, if you use distributed training, every gpu will have the same memory usage? I know that if you use dataparallel module, the gpu0 will have more memory consumtion.

cognitiverobot · September 5, 2019, 12:07am

Thanks. worked for me.

Cyril-JZ · November 25, 2019, 12:22pm

Worked for me! Thanks!!

ginobilinie · December 16, 2019, 6:25pm

Worked for me. Thanks.

Liangqiong_Qu · February 11, 2020, 9:25pm

I have tried to donot set the rank and world_size, but it shows that “ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set”

os.environ['MASTER_ADDR'] = '171.65.34.137'
os.environ['MASTER_PORT'] = '2000901'

#dist.init_process_group(backend, rank=rank, world_size=size)
dist.init_process_group(backend)

Do you have any idea what’s this comes from?

rvarm1 · February 12, 2020, 10:07pm

Could you please provide an example script to reproduce this, and the arguments that you’re passing in to DDP? thanks!

Liangqiong_Qu · February 15, 2020, 11:54pm

Hi sorry for the late reply. I have solved the issues.