The network I am using has a batch size of 1, but runs inference via a for loop over 32 samples before calling
I would like to parallelize over this for loop. I realize one way of doing so is via
DistributedDataParallel but I have few GPUs and my model is very small.
I am wondering if is possible to using
torch.distributed to parallelize over several CPU threads on one single GPU? When trying
torch.multiprocessing I get the error
Cowardly refusing to serialize non-leaf tensor which requires_grad, since autograd does not support crossing process boundaries. If you just want to transfer the data, call detach() on the tensor before serializing (e.g., putting it on the queue).
Any suggestions would be much appreciated!