How to parallelize the forward and backward computation

I have a question about how to parallelize my forward and backward computation

Here is the code:

batch_size = 32
for index, i in enumerate(indices):
	sentA = lsents[i]
	sentB = rsents[i]
	label = Variable(torch.LongTensor(labels[i])).cuda()
	output, extra_loss = model(sentA, sentB, index)
	loss = criterion(output, label)+extra_loss
	if (index+1) % batch_size == 0:

Because my model has some very complex interaction between two sentences, it is not very easy to batch the operation inside the model. So I only process one sample at a time. However, I update the parameter every 32 samples.

I have two questions:

  1. Is this operation valid (update the parameters every 32 samples)?
    My understand is each loss.backward() will accumulate the gradient in the .grad of every parameter, so optimizer.step() will update the parameter every 32 samples.
  2. As you can see, I update the parameters every 32 samples, which means these 32 .forward() + .backward() operations can be parallel on even one GPU (the computation for one sample doesn’t take so much memory). However, is there any suggestion about how to implement this?


  1. As long as there isn’t any batch specific computation (like BatchNormalization), this is a valid way to simulate batch of 32.

  2. You can implement a dataloader with batchsize=32.

Thanks for your kind reply!

For the second question, my model’s forward() function can only process one instance as input.

For my understanding, if I use a dataLoader whose batchsize = 32, then my model should be able to process batched input, is this correct?

Or another possibility is, although my model can only process one instance at a time, if I use a batched dataLoader and give the model batched input, the model will accept it and automatically convert the batched input to 32 individual input in someway?