I have a question about how to parallelize my forward and backward computation
Here is the code:
batch_size = 32
for index, i in enumerate(indices):
sentA = lsents[i]
sentB = rsents[i]
label = Variable(torch.LongTensor(labels[i])).cuda()
output, extra_loss = model(sentA, sentB, index)
loss = criterion(output, label)+extra_loss
loss.backward()
if (index+1) % batch_size == 0:
optimizer.step()
optimizer.zero_grad()
Because my model has some very complex interaction between two sentences, it is not very easy to batch the operation inside the model. So I only process one sample at a time. However, I update the parameter every 32 samples.
I have two questions:
- Is this operation valid (update the parameters every 32 samples)?
My understand is eachloss.backward()
will accumulate the gradient in the.grad
of every parameter, sooptimizer.step()
will update the parameter every 32 samples. - As you can see, I update the parameters every 32 samples, which means these 32
.forward()
+.backward()
operations can be parallel on even one GPU (the computation for one sample doesn’t take so much memory). However, is there any suggestion about how to implement this?
Thanks!