The network I am using has a batch size of 1, but runs inference via a for loop over 32 samples before calling .backward()
.
I would like to parallelize over this for loop. I realize one way of doing so is via DataParallel
and DistributedDataParallel
but I have few GPUs and my model is very small.
I am wondering if is possible to using torch.multiprocessing
or torch.distributed
to parallelize over several CPU threads on one single GPU? When trying torch.multiprocessing
I get the error
Cowardly refusing to serialize non-leaf tensor which requires_grad, since autograd does not support crossing process boundaries. If you just want to transfer the data, call detach() on the tensor before serializing (e.g., putting it on the queue).
Any suggestions would be much appreciated!