Batch processing for custom forward

I have a model with only 3 tensors (say p1, p2 and p3) as trainable parameters. The forward function takes in 2 tensors as input, calculates a score and returns it. Given below is the code:

def forward(self, a, b):
        score = custom_func(a, b, self.p1, self.p2, self.p3)
        return torch.sigmoid(score)

I am able to backprop and update the parameters.
However, I am currently calling forward() one-by-one for each example, accumulating the gradients using loss.backward() after each forward() and calling optimizer.step() every few examples.
Is there a way to process them in parallel? Instead of passing a, b as arguments to the forward function, can I instead pass a list?
I don’t know CUDA works internally but can I somehow spawn ‘separate CUDA threads’ for each of the examples and process them in parallel?


The short term answer is that you can. You “just” need to modify custom_func to accept input Tensors that have an extra dimension that will be the batch.

We are working on adding constructs to do this automatically but this is not ready yet and will be only in the fall I think.

I have a for loop inside custom_func which iterates over each element of the argument a along dimension 0 (number of iterations varies from example to example). So, if I insert the batch dimension, I’ll have to add another for loop outside this loop which iterates over the batch dimension, right? but that won’t give any performance benefit, will it?