Multiprocessing for-loop on CPU

I try to speed up a multidirectional RNN with torch.multiprocessing as I don’t get it to efficiently run on the GPU, but I have access to a lot of CPUs on a cluster. The one point where I want to apply multiprocessing is the for-loop over the different directions, as they are computed completely independent and only after all iterations are finished, the results are combined in a final layer.

So instead of

results = []
for d in directions:
    results.append(self.forward_direction(batch, param1, param2))

i tried doing

pool = mp.Pool(processes=self.num_directions)
results = pool.starmap(self.forward_direction, (batch, param1, param2))

This results in
MaybeEncodingError: Error sending result: '[tensor([...], grad_fn=<StackBackward>)]'. Reason: 'RuntimeError('Cowardly refusing to serialize non-leaf tensor which requires_grad, since autograd does not support crossing process boundaries. If you just want to transfer the data, call detach() on the tensor before serializing (e.g., putting it on the queue).',)'

Did not find a solution yet, is it even possible to multiprocess the for-loop without breaking autograd?

Hi,

You can look at the DistributedDataParallel module to perform distributed training.
Otherwise, we very recently added to master support for autograd with the distributed package. (not the multiprocessing package, the distributed package).