Hi! I’ve been looking into parallelize operations for different pytorch operations. On a model level - to e.g. train on several GPUs - this appears to be fairly straightforward, and there are plenty of good tutorials out there.
However, I have been trying to parallelize an operation where I split a batch-tensor, and operate on each of the individual samples, like so (this is just a mws - there is an actual reason for me to split the batch):
import torch
import torch.nn as nn
torch.multiprocessing.set_start_method("spawn")
import torch.multiprocessing as mp
class Model(nn.Module):
def __init__(self):
nn.Module.__init__(self)
self.lin1 = nn.Linear(100, 100)
self.lin2 = nn.Linear(100, 30)
self.lin3 = nn.Linear(30, 3)
def forward_single(self, x):
# just a dummy method
return self.lin2(x)
def forward(self, xs):
step1 = self.lin1(xs)
step2 = []
# ---- I would really like to parallelize the following loop
for x in torch.split(step1, 1):
step2.append(self.forward_single(x))
# ----
step2 = torch.cat(step2, dim=0)
ys = self.lin3(step2)
return ys
if __name__ == '__main__':
input = torch.ones(64, 100).cuda()
target = torch.ones(64, 3).cuda()
model = Model()
model.cuda()
output = model(input)
loss_func = nn.MSELoss()
loss = loss_func(output, target)
loss.backward()
print(output[0])
However, any approach I have tried results in the same error message: RuntimeError: Cowardly refusing to serialize non-leaf tensor which requires_grad, since autograd does not support crossing process boundaries. If you just want to transfer the data, call detach() on the tensor before serializing (e.g., putting it on the queue).
So - (1) can I actually do this? the error message sounds as if I may not be able to parallelize that loop at all (2) if I can, how? I’d take any dirty workaround, as my code is really pretty slow right now …
Thank you!