Truly parallel ensembles

Is it possible to concurrently pass data through a set an ensemble of neural networks (the ensemble contains num_models neural networks)? To give an idea of my current training pipeline, I do the following:

  1. Sample a batch of size n from the dataloader
  2. Pass that batch through one of the networks, returning a loss
  3. Add the loss to the total_loss so far
  4. Repeat from 1. with a different network until all the models have been iterated through once, then backprop total_loss to update all the models. Reset total_loss to 0 and start the outer loop once again.

Similarly, there is a use case where I may wish to pass the same batch of data through all neural networks simultaneously.

I imagine an approach where I can sample a single batch which is n * num_models, then passing that through all num_models in the ensemble simultaneously would be much faster, but I’m unsure how to do this. My instinct is to have a wrapper that runs .chunk or .split on the batch, but then I’ll still be running a for loop over the models (imagine they exist in a list for example) and summing their losses, so we’re back to square one. Having said this, will the async in PyTorch actually parallelise this if I write it in a forward method?

Thanks!

1 Like

If you have multiple devices, you can use a for loop and call each model separately with the corresponding input.
Since CUDA operations are executed asynchronously, all devices will be used at the same time.

However, if you are dealing with a single device, note that each CUDA call will be added to a queue and the device will most likely be busy with the single model.

1 Like

Great, thanks for clearing that up. In which case, would it be possible to design a layer class that is in fact multiple layers, then using something like an einsum, ensure that a data point is passed through each layer simultaneously?

Thanks,
Phil

2 Likes

If I have the same data and a for loop running over the ensemblesn, I would send the data and model to one of the GPUs and run the for loop sequentially. How to parallelize the forward pass and gradient calculation over ensembles? How to specify the device in the case of 8 GPUs, for instance?

For instance, how to change this pseudocode for that purpose?

 device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 all_loss = 0
 for i  in range(num_ensembles):
     net = net.to(device)
     input = input.to(device)
     loss = f(net(input))
     all_loss += loss

You could use CUDAStreams foe each execution, but note my previous warning: if your GPU is already busy with a single kernel from a model (e.g. as all compute resources are used) you might not see any speedup. Also, you have to make sure your CPU is able to run ahead and is not the bottleneck in scheduling the workload.

After reading the above comments, I’m still a bit unclear. Let’s assume that each model in the ensemble has been placed on a unique GPU. If I slightly modify the above example

 all_loss = 0
 for model_idx  in range(num_ensembles):
     input = input.to(f"cuda:{model_idx}")
     loss = f(net(input))
     all_loss += loss

Will this or will this not run the forward passes in parallel? How can I check?