I have a model with only 3 tensors (say p1, p2 and p3) as trainable parameters. The forward function takes in 2 tensors as input, calculates a score and returns it. Given below is the code:
def forward(self, a, b):
score = custom_func(a, b, self.p1, self.p2, self.p3)
return torch.sigmoid(score)
I am able to backprop and update the parameters.
However, I am currently calling forward() one-by-one for each example, accumulating the gradients using loss.backward() after each forward() and calling optimizer.step() every few examples.
Is there a way to process them in parallel? Instead of passing a, b as arguments to the forward function, can I instead pass a list?
I don’t know CUDA works internally but can I somehow spawn ‘separate CUDA threads’ for each of the examples and process them in parallel?