I am trying to use the torch.multiprocessing tool. However, it seems not working with the tensors who need grad. A cpu version code is shown below. Do you have any suggestions? (p.s. I have only one single GPU, one 8-core CPU and want to do parallel computation, but I see most of the tutorials are working on multiGPU using multiprocessing.)
Code:
import torch.multiprocessing as mp
import torch
def job(num):
return num * 2
if __name__ == '__main__':
p = mp.Pool(processes=20)
a = [torch.tensor(float(i), requires_grad=True) for i in range(20)]
# a = [torch.tensor(float(i), requires_grad=False) for i in range(20)] # works
data = p.map(job, a)
p.close()
print(data)
datatorch = torch.stack(data)
print(datatorch)
l = datatorch - datatorch / 2.0
print(l)
Error:
MaybeEncodingError: Error sending result: â[tensor(0., grad_fn=)]â. Reason: âRuntimeError(âCowardly refusing to serialize non-leaf tensor which requires_grad, since autograd does not support crossing process boundaries. If you just want to transfer the data, call detach() on the tensor before serializing (e.g., putting it on the queue).â,)â
We cannot track gradients across processes indeed with the regular autograd.
If you really want this, you can take a look at the experiemental distributed autograd we built on top of rpc here. Note that this is experimental so there might be some rough edges.
Thank you so much for the quick reply! I am not sure if âmultiprocessingâ is the thing I need. The actual scenario is, I have a batch input with size âBxNxCâ, and want to feed it into a layer named L (L consists of some differentiable operations without trainable parameters). However, L can only accept input with size âNxCâ, so that I need to write a for loop to feed the whole batch.
I find that even if I manually replicate L with B times, it still go through sequentially. Then I turn to parallel computation topics I am not sure if there is any method to speed up the âfor loopâ without touching the âLâ function/layer?
If your ops are small (and I assume they are since you donât batch), the overhead of multiprocessing is most likely going to be much higher than any benefit youâre gonna get from it
The best thing to do here woukd be to change L to accept batch of inputs (donât hesitate to ask questions about this here).
Otherwise, the for-loop will be the best thing you can do. Maybe multithreading can be a bit faster but will make your code much more complicated.
Hi alban, Thanks for the suggestions! The ops in L are complex and cost much more time compared with other layers. I just use its implementation from other people[1][2] but none of them accept batch I may try to reimplement it in the future.