Parallelizing for loop in inference compatible with autograd

The network I am using has a batch size of 1, but runs inference via a for loop over 32 samples before calling .backward().

I would like to parallelize over this for loop. I realize one way of doing so is via DataParallel and DistributedDataParallel but I have few GPUs and my model is very small.

I am wondering if is possible to using torch.multiprocessing or torch.distributed to parallelize over several CPU threads on one single GPU? When trying torch.multiprocessing I get the error

Cowardly refusing to serialize non-leaf tensor which requires_grad, since autograd does not support crossing process boundaries.  If you just want to transfer the data, call detach() on the tensor before serializing (e.g., putting it on the queue).

Any suggestions would be much appreciated!


You will have to use the distributed autograd engine to be able to run autograd across multiple processes. But it is experimental right now.

But for your problem, if your tasks are very small, you might want to make sure that the mutltiprocess overhead won’t be greater than the gain.

Thank you for your reply @albanD !

I noticed that in PyTorch 1.6 there is now thread parallelism for autograd on CPU. Does this mean I can now achieve what I wanted?

This new change doesn’t change anything for processes.
But if you have a multithreaded workload, it might help yes.