I have a question about how to parallelize my forward and backward computation

Here is the code:

```
batch_size = 32
for index, i in enumerate(indices):
sentA = lsents[i]
sentB = rsents[i]
label = Variable(torch.LongTensor(labels[i])).cuda()
output, extra_loss = model(sentA, sentB, index)
loss = criterion(output, label)+extra_loss
loss.backward()
if (index+1) % batch_size == 0:
optimizer.step()
optimizer.zero_grad()
```

Because my model has some very complex interaction between two sentences, it is not very easy to batch the operation inside the model. So I only process one sample at a time. However, I update the parameter every 32 samples.

I have two questions:

- Is this operation valid (update the parameters every 32 samples)?

My understand is each`loss.backward()`

will accumulate the gradient in the`.grad`

of every parameter, so`optimizer.step()`

will update the parameter every 32 samples. - As you can see, I update the parameters every 32 samples, which means these 32
`.forward()`

+`.backward()`

operations can be**parallel**on even one GPU (the computation for one sample doesn’t take so much memory). However, is there any suggestion about how to implement this?

Thanks!