My use case is to train multiple small models to form an parallel ensemble (for example, a bagging ensemble which can be trained in parallel), an example code can be found in the TorchEnsemble library (which is part of PyTorch ecosystem).
This example code uses joblib
library to train multiple small models in parallel on the same GPU. The core part of the parallel training logic is here:
from joblib import Parallel, delayed
# Maintain a pool of workers
with Parallel(n_jobs=self.n_jobs) as parallel:
# Training loop
for epoch in range(epochs):
rets = parallel(delayed(_parallel_fit_per_epoch)(...))
where _parallel_fit_per_epoch
function is responsible to train each base model for only one epoch, its core logic is like this:
def _parallel_fit_per_epoch(...):
for batch_idx, (data, target) in enumerate(train_loader):
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
Suppose that we want to train 10 base models to form our ensemble, it seems to me that setting n_jobs
> 1 does not provide faster training speed. If I set n_jobs=2
to train 10 base models, I would expect each job trains 5 base models and the 2 jobs are run in parallel, so in theory this should result in 1/2 of the training time compared to using n_jobs=1
to train 10 base models, which means a factor of 2 of training speedup. However, in practice this did not give any speedup but slower down the training: n_jobs=2
takes slightly longer time than n_jobs=1
(to train 10 base models). I would wonder why this happens?
My hypothesis is that when joblib is combined with PyTorch, what truly happens is Time Sharing (as suggested by this Quora answer). At any given time, only one single job is using the GPU (cuda). By design, the Time Sharing mechanism does not save the total training time and it also adds a context switch overhead cost, which explains why the total training time with joblib multiprocessing is even longer for model training with GPU. Is this correct?
Anyway, is there any way I could divide the cores of a single GPU into multiple groups and divide the GPU memory accordingly to train a parallel ensemble of small models with PyTorch? Is this even possible?
P.S. I do not know how to write cuda code and I am not an expert for multiprocessing.
As we suppose that each base model in the ensemble is a small model, we assume that the GPU memory of this single GPU is sufficient to host all these models at the same time (for forward pass + backward pass).