PyTorch: How to parallelize over multiple GPU using multiprocessing.pool

I have the following code which I am trying to parallelize over multiple GPUs in PyTorch:

import numpy as np
import torch
from torch.multiprocessing import Pool

X = np.array([[1, 3, 2, 3], [2, 3, 5, 6], [1, 2, 3, 4]])
X = torch.DoubleTensor(X).cuda()

def X_power_func(j):
    X_power = X**j
    return X_power

if __name__ == '__main__':
  with Pool(processes = 2) as p:   # Parallelizing over 2 GPUs
    results = p.map(X_power_func, range(4))

results

But when I ran the code, I am getting this error:

---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "<ipython-input-35-6529ab6dac60>", line 11, in X_power_func
    X_power = X**j
RuntimeError: CUDA error: initialization error
"""

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
<ipython-input-35-6529ab6dac60> in <module>()
     14 if __name__ == '__main__':
     15   with Pool(processes = 1) as p:
---> 16     results = p.map(X_power_func, range(8))
     17 
     18 results

1 frames
/usr/lib/python3.6/multiprocessing/pool.py in get(self, timeout)
    642             return self._value
    643         else:
--> 644             raise self._value
    645 
    646     def _set(self, i, obj):

RuntimeError: CUDA error: initialization error

Where have I gone wrong? Any help would really be appreciated.

By default, doing .cuda() will copy your tensor to device cuda:0. I do not see anywhere you have specified the device ids for multiple GPUs. Besides, results outside the scope of main will result in error. CUDA initialization error will be gone by using mp.set_start_method('spawn', force=True) before spawning the process pool, however, that would still not give you a correct implementation for what you are trying to do.

1 Like

Many thanks @bapi

I added mp.set_start_method('spawn', force=True) into the code below. Would this be right?

import numpy as np
import torch
import torch.multiprocessing as mp
from torch.multiprocessing import Pool

X = np.array([[1, 3, 2, 3], [2, 3, 5, 6], [1, 2, 3, 4]])
X = torch.DoubleTensor(X).cuda()

def X_power_func(j):
    X_power = X**j
    return X_power

if __name__ == '__main__':
  mp.set_start_method('spawn', force=True)
  with Pool(processes = 1) as p:   # Paralleizing over 2 GPUs
    results = p.map(X_power_func, range(2))

results

Also, how do I specify the device ids for multiple GPUs for my code?

Sorry if I have too many questions.

Any news? Have you solved the problem? How? I think that the heart of @bapi answer is that you have to manually transfer each input array (a fraction of it or the same, it depends on your problem)

I solved like this:

import time
import torch
from torch.multiprocessing import Pool
torch.multiprocessing.set_start_method('spawn', force=True)


def use_gpu(ind, arr):
    return (arr.std() + arr.mean()/(1+ arr.abs())).sum()


def mysenddata(mydata):
    return [(ii, mydata[ii].cuda(ii)) for ii in range(4)]


if __name__ == "__main__":
    print('create big tensor')
    aa = 10*torch.randn(4,10000,10000).double()
    print('send data')
    b = mysenddata(aa)

    for ii in range(10):
        pool = Pool(processes=4)
        a = time.time()
        print('start')
        with Pool(processes=4) as p:
        #result = pool.starmap(use_gpu, b,)
            results = p.starmap(use_gpu, b,)
        print('end')
        print("cost time :", time.time() - a)
        
        for ii, (rr, bb) in enumerate(zip(results, b)):
            print('idx:{}, inshape:{}, indevice:{}, intype:{}, outshape:{}, outdevice:{}, outtype:{}'.format(ii, bb[1].shape, bb[1].get_device(), bb[1].type(), rr.shape, rr.get_device(), rr.type()))
            

This code seems ok for general gpu processing, but it will not work if the backward method has to be called. Do someone have a simple tutorial on simple multi gpu processing done on multi-gpus?

1 Like

Hi @heavyfranz. I am afraid I haven’t found a solution for this problem yet, so your solution above helps!

When you say “backward” method, do you mean backpropagation?