Hello,

I am looking for a way to train multiple models on a single GPU(RTX A5000) in parallel. Actually, these are many (thousands) small non-linear inversion problems that I want to solve as efficiently as possible. Each problem is independent of the others and has unique input/output and objective function (loss function). From my limited knowledge on this topic I believe this should be a good candidate for parallelization.

I have already implemented a working version using numpy/scipy where solving one individual problem takes about 2 seconds depending on convergence criterias and so on. If, I run them in parallel using python multiprocessing, the average runtime is reduced to about 0.2 seconds per problem.

The next natural step for me to make the code run faster is to use GPU acceleration (which I have never done before). I know there are several ways to interact with the GPU, but after reading up on it, I thought pytorch could be a good alternative for me due to it also having automatic differentiation. I actually have written a code that works with pytorch, but it is not fast (see sample code below).

The problem I have is that I can’t seem to start running the individual optimization problems on the GPU in parallel. This may be a straightforward thing to do, but I have been looking around at various places including asking chatGPT and Bing chat without any luck.

I tried to do it by using torch multiprocessing, but it seems that it does not work as I thought it did. It seems as the processes are started in parallel on the CPU and not on the GPU (is this correct?).

Anyways, I would greatly appreciate if anyone could point me in the right direction on this problem.

Here’s a simplified sample code:

```
from time import time
import torch
import torch.optim as optim
import torch.nn as nn
import torch.multiprocessing as mp
class Model(nn.Module):
def __init__(self, N):
super().__init__()
self.model_parameter = nn.Parameter(torch.ones(N))
def forward(self, input):
output = do_something_with_input(input)
return output
# Objective function
def objective_function(modeled, true_data):
return torch.sum((modeled-true_data)**2)
def train_model(model, input, true_data):
# Move data to GPU
input = input.to('cuda')
true_data = true_data.to('cuda')
model = model.to('cuda')
optimizer = optim.Adam(model.parameters(), lr=0.01)
# Optimization
for iteration in range(20):
optimizer.zero_grad()
modeled_data = model(input)
loss = objective_function(modeled_data, true_data)
loss.backward()
optimizer.step()
def train_process(queue):
while not queue.empty():
model, input, true_data = queue.get()
train_model(model, input, true_data)
if __name__ == "__main__":
input_data = load_the_input_data() # Torch tensor with shape (M,N)
observed_data = load_the_observed_data # Torch tensor with shape (M,N)
num_processes = 4 # I have tried setting this equal to number of models, and that does not work well, even when I limit the number of models
models = [Model(N) for i in range(M)] # N is number of samples, M is number of problems
mp.set_start_method('spawn', force=True)
queue = mp.Queue()
for i in range(M):
queue.put((models[i], input_data[i], observed_data[i]))
processes = []
for i in range(num_processes):
p = mp.Process(target=train_process, args=(queue,))
p.start()
processes.append(p)
# Wait for all processes to finish
for p in processes:
p.join()
```