I have the following code which I am trying to parallelize over multiple GPUs in PyTorch:
import numpy as np
import torch
from torch.multiprocessing import Pool
X = np.array([[1, 3, 2, 3], [2, 3, 5, 6], [1, 2, 3, 4]])
X = torch.DoubleTensor(X).cuda()
def X_power_func(j):
X_power = X**j
return X_power
if __name__ == '__main__':
with Pool(processes = 2) as p: # Parallelizing over 2 GPUs
results = p.map(X_power_func, range(4))
results
But when I ran the code, I am getting this error:
---------------------------------------------------------------------------
RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "<ipython-input-35-6529ab6dac60>", line 11, in X_power_func
X_power = X**j
RuntimeError: CUDA error: initialization error
"""
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
<ipython-input-35-6529ab6dac60> in <module>()
14 if __name__ == '__main__':
15 with Pool(processes = 1) as p:
---> 16 results = p.map(X_power_func, range(8))
17
18 results
1 frames
/usr/lib/python3.6/multiprocessing/pool.py in get(self, timeout)
642 return self._value
643 else:
--> 644 raise self._value
645
646 def _set(self, i, obj):
RuntimeError: CUDA error: initialization error
Where have I gone wrong? Any help would really be appreciated.
By default, doing .cuda() will copy your tensor to device cuda:0. I do not see anywhere you have specified the device ids for multiple GPUs. Besides, results outside the scope of main will result in error. CUDA initialization error will be gone by using mp.set_start_method('spawn', force=True) before spawning the process pool, however, that would still not give you a correct implementation for what you are trying to do.
Any news? Have you solved the problem? How? I think that the heart of @bapi answer is that you have to manually transfer each input array (a fraction of it or the same, it depends on your problem)
I solved like this:
import time
import torch
from torch.multiprocessing import Pool
torch.multiprocessing.set_start_method('spawn', force=True)
def use_gpu(ind, arr):
return (arr.std() + arr.mean()/(1+ arr.abs())).sum()
def mysenddata(mydata):
return [(ii, mydata[ii].cuda(ii)) for ii in range(4)]
if __name__ == "__main__":
print('create big tensor')
aa = 10*torch.randn(4,10000,10000).double()
print('send data')
b = mysenddata(aa)
for ii in range(10):
pool = Pool(processes=4)
a = time.time()
print('start')
with Pool(processes=4) as p:
#result = pool.starmap(use_gpu, b,)
results = p.starmap(use_gpu, b,)
print('end')
print("cost time :", time.time() - a)
for ii, (rr, bb) in enumerate(zip(results, b)):
print('idx:{}, inshape:{}, indevice:{}, intype:{}, outshape:{}, outdevice:{}, outtype:{}'.format(ii, bb[1].shape, bb[1].get_device(), bb[1].type(), rr.shape, rr.get_device(), rr.type()))
This code seems ok for general gpu processing, but it will not work if the backward method has to be called. Do someone have a simple tutorial on simple multi gpu processing done on multi-gpus?