Hi! I’ve been looking into parallelize operations for different pytorch operations. On a model level - to e.g. train on several GPUs - this appears to be fairly straightforward, and there are plenty of good tutorials out there.
However, I have been trying to parallelize an operation where I split a batch-tensor, and operate on each of the individual samples, like so (this is just a mws - there is an actual reason for me to split the batch):
import torch import torch.nn as nn torch.multiprocessing.set_start_method("spawn") import torch.multiprocessing as mp class Model(nn.Module): def __init__(self): nn.Module.__init__(self) self.lin1 = nn.Linear(100, 100) self.lin2 = nn.Linear(100, 30) self.lin3 = nn.Linear(30, 3) def forward_single(self, x): # just a dummy method return self.lin2(x) def forward(self, xs): step1 = self.lin1(xs) step2 =  # ---- I would really like to parallelize the following loop for x in torch.split(step1, 1): step2.append(self.forward_single(x)) # ---- step2 = torch.cat(step2, dim=0) ys = self.lin3(step2) return ys if __name__ == '__main__': input = torch.ones(64, 100).cuda() target = torch.ones(64, 3).cuda() model = Model() model.cuda() output = model(input) loss_func = nn.MSELoss() loss = loss_func(output, target) loss.backward() print(output)
However, any approach I have tried results in the same error message:
RuntimeError: Cowardly refusing to serialize non-leaf tensor which requires_grad, since autograd does not support crossing process boundaries. If you just want to transfer the data, call detach() on the tensor before serializing (e.g., putting it on the queue).
So - (1) can I actually do this? the error message sounds as if I may not be able to parallelize that loop at all (2) if I can, how? I’d take any dirty workaround, as my code is really pretty slow right now …