Hi,

I have a simple problem (hopefully!) regarding parallelizing a part of my model. I have looked around but cannot seem to find a definitive answer on how to approach this although similar questions seem to have been asked.

TLDR: How do you do a *parallel for loop* across multiple CPUs or GPUs in the same computer in the middle of a gradient step?

What I have is multiple additional computations which I know are embarassingly parallelizable, but compute bound. Currently, I am calculating them sequentially in a for loop within a .fit() function. These results are accumulated and then combined to produce the final loss.

A code sketch is as follows:

```
for i in range(epochs):
self.optimizer.zero_grad()
loss = self.fit_get_loss()
loss.backward(retain_graph=False)
def fit_get_loss():
# This is the for loop to parallelize
total_loss = 0
for j in range(self.N_extra_models):
m1 = self.extra_models_1[j]
m2 = self.extra_models_2[j]
loss1 = m1.fit_get_loss(with_grad=True)
loss2 = m2.fit_get_loss(with_grad=True)
total_loss = total_loss + loss1 + loss2
return total_loss
```

I would like to distribute the computation across many CPUs (e.g. a workstation with 20 cores) such that , for example, i have 20 of those j iterations occuring in parallel and I just accumulate the loss value. This is preferrable to do on CPU given the hardware I currently have available. However I am also keen to know how to apply the sample problem to multiple GPUs.

Actually, the real problem I have is more complicated than the above, and I need to re-use the extra_models but I think the above is the simplest form of what i’m trying to achieve. The extension of the problem is that I would like to access the updated fitted values for each of those extra_models within the main optimization loop call.

thanks again