Any news? Have you solved the problem? How? I think that the heart of @bapi answer is that you have to manually transfer each input array (a fraction of it or the same, it depends on your problem)
I solved like this:
import time
import torch
from torch.multiprocessing import Pool
torch.multiprocessing.set_start_method('spawn', force=True)
def use_gpu(ind, arr):
return (arr.std() + arr.mean()/(1+ arr.abs())).sum()
def mysenddata(mydata):
return [(ii, mydata[ii].cuda(ii)) for ii in range(4)]
if __name__ == "__main__":
print('create big tensor')
aa = 10*torch.randn(4,10000,10000).double()
print('send data')
b = mysenddata(aa)
for ii in range(10):
pool = Pool(processes=4)
a = time.time()
print('start')
with Pool(processes=4) as p:
#result = pool.starmap(use_gpu, b,)
results = p.starmap(use_gpu, b,)
print('end')
print("cost time :", time.time() - a)
for ii, (rr, bb) in enumerate(zip(results, b)):
print('idx:{}, inshape:{}, indevice:{}, intype:{}, outshape:{}, outdevice:{}, outtype:{}'.format(ii, bb[1].shape, bb[1].get_device(), bb[1].type(), rr.shape, rr.get_device(), rr.type()))
This code seems ok for general gpu processing, but it will not work if the backward method has to be called. Do someone have a simple tutorial on simple multi gpu processing done on multi-gpus?