Hogwild on MultiGPU

bapi · May 3, 2019, 2:42pm

Hi,

I need to use multiple GPUs available in the machine in a way that each of the processes uses exactly one GPU. I modified the mnist_hogwild code https://github.com/pytorch/examples/blob/master/mnist_hogwild/main.py as the following:

dataloader_kwargs = {'pin_memory': True} if use_cuda else {}
    dcount = torch.cuda.device_count()
    devices = []
    model = Net()
    for i in range(dcount):
        devices.append(torch.device("cuda:"+str(i)))
    torch.manual_seed(args.seed)
    mp.set_start_method('spawn')

    # model = Net().to(device)
    for i in range(dcount):
        model.to(devices[i])
    model.share_memory() # gradients are allocated lazily, so they are not shared here

    processes = []
    for rank in range(args.num_processes):
        p = mp.Process(target=train, args=(rank, args, model, devices[int(rank%dcount)], dataloader_kwargs))
        # We first train the model across `num_processes` processes
        p.start()
        processes.append(p)
    for p in processes:
        p.join()

However, while running this code with num_processes = 2, as there are two GPUs in my machine, I can see only one of them engaged. Can you please suggest what exactly I need in the code here?

auro · June 3, 2019, 5:53am

Please review my version.

github.com

aurotripathy/menace/blob/master/mnist_multigpu_hogwild/main.py

"""
Adding multi-gpu support to mnist w/hogwild
"""
from __future__ import print_function
import argparse
import torch
import torch.multiprocessing as mp
from model import Net
from train import train, test
from shared_optim import SharedAdam


# Training settings
parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
parser.add_argument('--batch-size', type=int, default=64, metavar='N',
                    help='input batch size for training (default: 64)')
parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
                    help='input batch size for testing (default: 1000)')
parser.add_argument('--epochs', type=int, default=10, metavar='N',
                    help='number of epochs to train (default: 10)')

This file has been truncated. show original

I’m happy to fix issues, improve readability.

This really is derived from an RL implementation by @dgriff available here

pietern · June 24, 2019, 6:47am

This snippet will first move the model to device 0 and then to device 1. If you don’t explicitly move the model in the functions you’re running through multiprocessing, then you’ll have to make this dependent on the rank of the target process. As is, I assume you’re only using process 1.

tengerye · October 7, 2021, 3:54am

Hi, would you mind explaining why the shared optimizer is necessary please?