Parallelize simple for-loop for single GPU

jose · January 3, 2019, 6:36pm

Hello,

I have a for loop which makes independent calls to a certain function. The calls should be processed in parallel, as they are completely independent. I have the following code which works for CPU. For GPU I am still trying to get it working.

import multiprocessing
from joblib import Parallel, delayed

class some_function(Function):
    @staticmethod
    def forward(ctx, x):
        pass # here goes the code of the forward pass

    @staticmethod
    def backward(ctx, grad_output):
        pass

class L_Field(nn.Module):
    def __init__(self):
        super(L_Field, self).__init__()

    def forward(self, x):
        batch_size, channels, height, width = x.shape
        num_cores = multiprocessing.cpu_count()
        results = Parallel(n_jobs=num_cores, backend="threading")(delayed(some_function.apply)(x[b, c]) for b in range(batch_size) for c in range(channels))
        out = torch.stack(results).reshape(batch_size, channels, height, width)
        return out

For the GPU, I tried the following:

import torch.multiprocessing as mp

class some_function(Function):
    @staticmethod
    def forward(ctx, x):
        pass # here goes the code of the forward pass

    @staticmethod
    def backward(ctx, grad_output):
        pass

class L_Field(nn.Module):
    def __init__(self):
        super(L_Field, self).__init__()

    def forward(self, x):
        batch_size, channels, height, width = x.shape
        num_cores = mp.cpu_count()
        x.share_memory_()
        pool = mp.Pool(processes=num_cores)
        results = [pool.apply(some_function.apply, args=(x[b, c],)) for b in range(batch_size) for c in range(channels)]
        pool.close()
        pool.join()
        out = torch.stack(results).reshape(batch_size, channels, height, width)
        return out

For which I get:

RuntimeError: Cowardly refusing to serialize non-leaf tensor which requires_grad, since autograd does not support crossing process boundaries.  If you just want to transfer the data, call detach() on the tensor before serializing (e.g., putting it on the queue).

If I do the detach:

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

Then, if I do mp.set_start_method('spawn', force=True):

RuntimeError: CUDA error: out of memory

Any ideas will be appreciated.

ismaelardo · March 16, 2019, 6:35pm

could you solve it?
i have the same problem, i want to do a multiprocess over a for loop, here is my forward code:

def forward(self,x):
    output,(self.h,self.c)=self.lstm(x,(self.h,self.c))
    i=0
    out=[]
    for batch_size in output.batch_sizes:
      out.append(self.linear(output.data[i:batch_size+i,:]))
      i=i+batch_size
      t=t+1
    outs=torch.cat(out)
    return outs```
(output is a PaquedSequence and self.lstm and self.linear are the respective nn.Modules)
i have tried differents ways but allways i receive the ```
RuntimeError: Cowardly refusing to serialize non-leaf tensor which requires_grad, since autograd does not support crossing process boundaries.  If you just want to transfer the data, call detach() on the tensor before serializing (e.g., putting it on the queue).

is it possible to do multiprocess over inputs that are non-leaf and requires grad?
thank you !

ismaelardo · March 18, 2019, 12:29am

Please, @ptrblck_de (or some expert), it is possible to make multiprocessing over inputs that require grad an are not leafs?, how it is done?
I would appreciate any help.

derek_kim · July 7, 2019, 7:17am

I have the same issue.

konstantinosKokos · August 30, 2019, 3:27pm

also waiting for an update on this issue

Dongsu_Zhang · April 3, 2020, 10:23am

I also have the same issue

Gaoyuan_Yang · July 6, 2020, 10:17am

Hi, did you find a solution here, I find the similar question https://discuss.pytorch.org/t/how-to-parallelize-a-loop-over-the-samples-of-a-batch/32698 and https://discuss.pytorch.org/t/running-multiple-modules-in-parallel/58164/2, both answers give a way of reformulate the problem, however they did not solve the problem of multi-processing inside one GPU inside forward function.