Multiprocessing with tensors(requires grad)


I am trying to use the torch.multiprocessing tool. However, it seems not working with the tensors who need grad. A cpu version code is shown below. Do you have any suggestions? (p.s. I have only one single GPU, one 8-core CPU and want to do parallel computation, but I see most of the tutorials are working on multiGPU using multiprocessing.)


import torch.multiprocessing as mp
import torch
def job(num):
    return num * 2

if __name__ == '__main__':
    p = mp.Pool(processes=20)
    a = [torch.tensor(float(i), requires_grad=True) for i in range(20)]
    # a = [torch.tensor(float(i), requires_grad=False) for i in range(20)] # works

    data =, a)
    datatorch = torch.stack(data)
    l = datatorch - datatorch / 2.0


MaybeEncodingError: Error sending result: ‘[tensor(0., grad_fn=)]’. Reason: ‘RuntimeError(‘Cowardly refusing to serialize non-leaf tensor which requires_grad, since autograd does not support crossing process boundaries. If you just want to transfer the data, call detach() on the tensor before serializing (e.g., putting it on the queue).’,)’


We cannot track gradients across processes indeed with the regular autograd.

If you really want this, you can take a look at the experiemental distributed autograd we built on top of rpc here. Note that this is experimental so there might be some rough edges.


Thank you so much for the quick reply! I am not sure if “multiprocessing” is the thing I need. The actual scenario is, I have a batch input with size “BxNxC”, and want to feed it into a layer named L (L consists of some differentiable operations without trainable parameters). However, L can only accept input with size “NxC”, so that I need to write a for loop to feed the whole batch.

I find that even if I manually replicate L with B times, it still go through sequentially. Then I turn to parallel computation topics :smiley: I am not sure if there is any method to speed up the “for loop” without touching the “L” function/layer?

I will take a look at the link you share, thanks!


If your ops are small (and I assume they are since you don’t batch), the overhead of multiprocessing is most likely going to be much higher than any benefit you’re gonna get from it :confused:

The best thing to do here woukd be to change L to accept batch of inputs (don’t hesitate to ask questions about this here).
Otherwise, the for-loop will be the best thing you can do. Maybe multithreading can be a bit faster but will make your code much more complicated.

Hi alban, Thanks for the suggestions! The ops in L are complex and cost much more time compared with other layers. I just use its implementation from other people[1][2] but none of them accept batch :thinking: I may try to reimplement it in the future.

Much appreciated!!

@ZEHANG_WENG Did you end up finding a solution to this?

I have a similar problem of the library “torchsort” only accepting a 2D input, but I have 3D [B, H. W] batches.

I would somehow like to parallelize multiple batch operations.