The right way to distribute the training over multiple GPUs and nodes

Hi guys :sweat_smile: I read a lot about how to distribute my training over multiple GPUs in pytorch Documentations. And i am confused, I had to read about multiprocessing in general(because i didn’t have any knowledge about it :sweat_smile:) I don’t know if my code is right or not but every time i run it i see error

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1525812548180/work/aten/src/THC/THCTensorRandom.cu line=25 error=46 : all CUDA-capable devices are busy or unavailable
And

RuntimeError: cuda runtime error (46) : all CUDA-capable devices are busy or unavailable at /opt/conda/conda-bld/pytorch_1525812548180/work/aten/src/THC/THCTensorRandom.cu:25
def run(rank, size):
    device = torch.device("cuda")
    dataset = datasets.MNIST('./data', train=True,
                             transform=transforms.Compose([
                                 transforms.ToTensor(),
                                 transforms.Normalize((0.1307,), (0.3081,))]))
    # from torch.utils.data.distributed import DistributedSampler
    train_sampler = DistributedSampler(dataset)
    train_loader = torch.utils.data.DataLoader(dataset, batch_size=128, 
                                                                       num_workers=4,
                                                                       pin_memory=True, 
                                                                       sampler=train_sampler)
    torch.manual_seed(1234)

    model = Net().cuda()
    model = torch.nn.parallel.DistributedDataParallel(model)
    optimizer = optim.SGD(model.parameters(),
                          lr=0.01, momentum=0.5)

    for epoch in range(10):
        epoch_loss = 0.0
        train_sampler.set_epoch(epoch)
        for data, target in train_loader:
            data, target = data.to(device), target.to(device)
            optimizer.zero_grad()
            output = model(data)
            loss = F.nll_loss(output, target)
            epoch_loss += loss.item()
            loss.backward()
            optimizer.step()
        print('Rank ', dist.get_rank(), ', epoch ',
              epoch, ': ', epoch_loss / num_batches)

def init_processes(rank, size, fn, backend='gloo'):
    """ Initialize the distributed environment. """
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(backend, rank=rank, world_size=size)
    fn(rank, size)
size = 4
processes = []
for rank in range(size):
    p = Process(target=init_processes, args=(rank, size, run))
    processes.append(p)
    p.start()

I also read distributed imagenet example to try to do the same.
Is that the correct way of using distributed training?

PS: btw the first error appears 3 times the second error appears 4 times(I only took the last line of the second error). Both errors appear at the same time
I also tried to google that error but I could not use it to fix my error

Hi it’s usually simpler to start several python processes using the torch.distributed.launch utility of PyTorch. Here is a (very) simple introduction about distributed training in PyTorch (there are several ways you can improve over that but it will show you an example in action).

2 Likes