I have a question about how to parallelize my forward and backward computation
Here is the code:
batch_size = 32 for index, i in enumerate(indices): sentA = lsents[i] sentB = rsents[i] label = Variable(torch.LongTensor(labels[i])).cuda() output, extra_loss = model(sentA, sentB, index) loss = criterion(output, label)+extra_loss loss.backward() if (index+1) % batch_size == 0: optimizer.step() optimizer.zero_grad()
Because my model has some very complex interaction between two sentences, it is not very easy to batch the operation inside the model. So I only process one sample at a time. However, I update the parameter every 32 samples.
I have two questions:
- Is this operation valid (update the parameters every 32 samples)?
My understand is each
loss.backward()will accumulate the gradient in the
.gradof every parameter, so
optimizer.step()will update the parameter every 32 samples.
- As you can see, I update the parameters every 32 samples, which means these 32
.backward()operations can be parallel on even one GPU (the computation for one sample doesn’t take so much memory). However, is there any suggestion about how to implement this?