Use joblib to train an ensemble of small models on the same GPU in parallel

My use case is to train multiple small models to form an parallel ensemble (for example, a bagging ensemble which can be trained in parallel), an example code can be found in the TorchEnsemble library (which is part of PyTorch ecosystem).

This example code uses joblib library to train multiple small models in parallel on the same GPU. The core part of the parallel training logic is here:

from joblib import Parallel, delayed

# Maintain a pool of workers
with Parallel(n_jobs=self.n_jobs) as parallel:
    # Training loop
    for epoch in range(epochs):
        rets = parallel(delayed(_parallel_fit_per_epoch)(...))

where _parallel_fit_per_epoch function is responsible to train each base model for only one epoch, its core logic is like this:

def _parallel_fit_per_epoch(...):
    for batch_idx, (data, target) in enumerate(train_loader):
        output = model(data)
        loss = criterion(output, target)

Suppose that we want to train 10 base models to form our ensemble, it seems to me that setting n_jobs > 1 does not provide faster training speed. If I set n_jobs=2 to train 10 base models, I would expect each job trains 5 base models and the 2 jobs are run in parallel, so in theory this should result in 1/2 of the training time compared to using n_jobs=1 to train 10 base models, which means a factor of 2 of training speedup. However, in practice this did not give any speedup but slower down the training: n_jobs=2 takes slightly longer time than n_jobs=1 (to train 10 base models). I would wonder why this happens?

My hypothesis is that when joblib is combined with PyTorch, what truly happens is Time Sharing (as suggested by this Quora answer). At any given time, only one single job is using the GPU (cuda). By design, the Time Sharing mechanism does not save the total training time and it also adds a context switch overhead cost, which explains why the total training time with joblib multiprocessing is even longer for model training with GPU. Is this correct?

Anyway, is there any way I could divide the cores of a single GPU into multiple groups and divide the GPU memory accordingly to train a parallel ensemble of small models with PyTorch? Is this even possible?

P.S. I do not know how to write cuda code and I am not an expert for multiprocessing.

As we suppose that each base model in the ensemble is a small model, we assume that the GPU memory of this single GPU is sufficient to host all these models at the same time (for forward pass + backward pass).

That’s not necessarily the case as you would need to have free compute resources on your GPU and would also need to use different streams while still providing enough resources for a parallel execution. If a single stream already saturates the compute resources from your GPU, the stream launches will be serialized as they cannot run in parallel.

Thanks a lot for your answer, do I need to write the cuda stream code myself to train several tiny/small models in parallel on a single GPU or joblib library combined with PyTorch automatically handles this under the hood?

This Quora answer says that there are two possible mechanisms to break one GPU for several tasks running at the same time: Dividing the cores of GPU into multiple groups and dividing the GPU memory accordingly, or Time Sharing. Which mechanism among these two is happening when I write the cuda stream code?

“Time sharing” is not a programming approach and the user describes it as a way for e.g. cloud providers to swap between jobs of different users. Of course PyTorch does not implement such a behavior for your workstation.

Divide the cores of GPU into multiple groups and divide the GPU memory accordingly.

I don’t fully understand this claim and guess the user points out that each kernel should not occupy the entire GPU resources to allow concurrent kernels to be executed.

Take a look at this topic and the linked GTC talk which gives you a good overview how the GPU works.