Training multiple regressors & questions on parallelizing

Hi All,

I would like to ask your help with an implementation. I am trying to train 200 regressors on the same data ( think of these as 200 copies of the same model and same X, but different Y values to regress to for each).

My future implementation would involve aggregating these 200 predicted values. But for now, I just want to run 200 copies of models, all having the same input but different Y_true to regress to. Also, I do not want them to share weights i.e. each should independently update its weights.

To create this, I tried instantiating multiple instances of the net as follows:

class NeuralNetwork(nn.Module):
        def __init__(self):
            super(NeuralNetwork, self).__init__()
            self.layer1 = nn.Linear(X.shape[1], 100) # 
            self.layer2 = nn.Linear(100, 10)
            self.layer3 = nn.Linear(10, 1)


        def forward(self, x):
            x = F.relu(self.layer1(x))
            x = F.relu(self.layer2(x))
            x = self.layer3(x)

            return x

models = [] #to store instances of NNs, one per regressor
losses = [] #to store regressor-wise losses
y_pred = [] #to store the regression predictions

models = []
losses = [0]*200
y_pred = [0]*200
for i in range(200):  # Initialization loop
  m = NeuralNetwork()
  optimizer = optim.Adam(m.parameters(), lr=0.005)
  mse_loss = nn.MSELoss()
  models.append(m)

for i in range(200): # Execution loop
  for epoch in range(10):
    predicted = models[i](x_train_tensor)
    y_pred[i] = predicted
    loss = mse_loss(predicted,y_train_tensor[i])
    losses[i] = loss
    loss.backward()
    optimizer.step()

I have 3 questions about this implementation:
a) Would this work? :slight_smile:
b) Is the execution loop the right place to call optimizer.step()?
c) How can I parallelize these regressors? I was reading up on Dataparallel module, but that seems to be for models training on different Xs itself. Could you please share some ideas on this?

a) Not quite. You are creating a single optimizer and are overwriting the object, so only the last model would be updated. Besides that you are currently storing the loss tensor in the losses list which will not only store the actual loss value but the entire computation graph and would thus increase the memory usage in each iteration. Call detach() or item() on the loss before appending it (same for predicted).

b) No, as described in a) you would have to create an optimizer for each model and also call optimizer.zero_grad() in the training loop.

c) If you are using the GPU, the CUDA kernels would be scheduled on the device and could execute in parallel if the CPU is fast enough to enqueue the kernels and if enough GPU resources are available.