Train multiple models simultaneously on a single GPU

I want to train an ensemble of NNs on a single GPU in parallel.
Currently I’m doing this:

for model in models:
  model.to('cuda')
  train_model(model, ...)

Each model is quite small but the GPU utilisation is tiny (3%), which makes me think that the training is happening serially. In fact, if I increase the number of members of the ensemble the training time increases proportionally. This confuses me, as CUDA calls should be asynchronous. Is there any reason why this shouldn’t run in parallel?

I’ve tried to go a step further by taking inspiration from torchensemble using joblib (see code below) but I am not seeing a speedup whatsoever.

def _parallel_fit_per_epoch(
    train_loader,
    estimator,
    optimizer,
    criterion,
    device,
):
    for batch_idx, elem in enumerate(train_loader):
        data, target = elem[0].to(device), elem[1].to(device)
        optimizer.zero_grad()
        output = estimator.forward(data)
        loss = criterion(output, target.unsqueeze(-1).float())
        loss.backward()
        optimizer.step()
    return estimator, optimizer, loss

class Regressor(nn.Module):
    # etc
class Ensemble(nn.Module):

    def __init__(self, num_models, ...):
        self.num_models = num_models
        self.models = [Regressor(...) for _ in self.num_models]
       # etc 

    def fit(self, train_dataset,valid_dataset,epochs,batch_size, learning_rate,
             weight_decay, patience, save_model=False, save_dir=None):
        optimizers = []
        train_loaders = []
        for model in self.models:
            opt = torch.optim.Adam(model.parameters(), lr=learning_rate)            
            optimizers.append(opt)
            train_loaders.append(train_dataset, batch_size=batch_size, shuffle=True))

        with Parallel(n_jobs=self.num_models) as parallel:
            # Training loop
            for epoch in range(epochs):
                rets = parallel(
                    delayed(_parallel_fit_per_epoch)(
                        dataloader,
                        estimator,
                        optimizer,
                        self.loss,
                        self.device,
                    )
                    for idx, (estimator, optimizer, dataloader) in enumerate(
                        zip(self.models, optimizers, train_loaders)
                    )
                )

                estimators, optimizers, losses = [], [], []
                for estimator, optimizer, loss in rets:
                    estimators.append(estimator)
                    optimizers.append(optimizer)
                    losses.append(loss)

I’m running all this on a slurm cluster, using:

#SBATCH --nodes=1
#SBATCH --mem=120G
#SBATCH --gres=gpu:1

Are there any other directives I should be using?

CUDA calls are indeed asynchronous. However, your CPU must be fast enough in scheduling the kernels so that a parallel execution would even be possible.
A low GPU utilization often indicates that your use case is CPU-limited and that the GPU is waiting for new work between short periods of actual execution.
You could check if via a visual profiler such as Nsight Systems which would indicate a lot of whitespaces between the kernels in the timeline.

Besides that your GPU would also need free compute resources to be able to run multiple kernels in parallel.

Thanks @ptrblck for the fast reply! I have a question though: if CUDA calls work asynchronously and my first piece of code should work in parallel (with the caveats you mentioned) then could you explain to me what is the point of using joblib in torchensemble, or vmap in functorch (and JAX)?

I’m not familiar with torchensemble, but would guess the authors tried to use multiprocessing via joblib to avoid the aforementioned CPU bottlenecks.

vmap tries to automatically batch workloads instead of using a loop over a specific op.
The torch.vmap recipe shows an example for torch.dot which does not accept batched inputs as well as other examples.

Thanks for that. Would these bottlenecks come from anything other than a custom Dataset or/and training loop? My custom Dataset only performs indexing, and my training loop is just the standard Torch loop (see below). I’ve run this on Colab and I keep getting the same performance as on the cluster so it’s not related to slurm directives. For reference, each training epoch takes about 0.1s, so about 10s for 100 epochs of training. Isn’t that enough for the CPU to schedule the kernels?

class MyDataset(Dataset):
    def __init__(self, X, y, z, indices):
        self.X = X
        self.y = y
        self.z = z
        self.indices = indices

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx], self.z[idx], self.indices[idx]

class Regressor(nn.Module):
    # various defs..
    def train_step(self, dataloader):
        for _, elem in enumerate(train_loader):
            data, target = elem[0].to('cuda'), elem[1].to('cuda')
            optimizer.zero_grad()
            output = estimator.forward(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()    
class Ensemble(nn.Module):
    def __init__(self, num_models, ...):
        self.num_models = num_models
        self.models = [Regressor(...).to('cuda') for _ in self.num_models]

X,Y,Z, ID = get_data()
train_dataset = MyDataset(X,Y,Z,ID)
train_loader = DataLoader(train_dataset, shuffle=True, pin_memory=True, batch_size=256)
ensemble = Ensemble()
for model in ensemble.models:
  model.train()
  for epoch in range(epochs):
      train_loss_ = model.train_step(train_loader, opt, epoch=epoch)

Just to make sure, in code snippet 1 are you moving estimator to the GPU?

In code snippet 2, I believe you’re registering the optimizer in Regressor first, and then moving to cuda. You must register params to the optimizer after you move to cuda (ref: "Parameters of a model after .cuda() will be different objects with those before the call." is wrong. · Issue #7844 · pytorch/pytorch · GitHub)

By the way, it’s recommended to use ModuleList and a single optimizer (unless you want different optimizer algos on subsets of the ensemble…). This makes saving to disk easy; you only need to save ensemble and optimizer instead of saving all the different objects in your lists. See this quick example for reference:

class Ensemble(nn.Module):
   def __init__(self, num_models, ...):
        self.num_models = num_models
        self.model_list = nn.ModuleList([Regressor(...) for _ in self.num_models])
       # etc 

   def forward (data, target):
        loss = 0
        for mod in self.model_list:
            loss += mod(data, target, self.criterion). # calculate loss in the regressor's forward
        return loss
  
def fit(ensemble, ...):
    ensemble.to(device)
   # register a single optimizer (with different param_groups for each regressors if necessary) instead of constructing multiple optimizers 
    optimizer = Adam(ensemble.parameters(), lr=lr)
    for data, target in loader:
        loss = ensemble(data, target)
        optimizer.zero_grad()
        loss.backward(); optimizer.step()

Thanks @suraj.pt! Yes I was moving estimator to GPU, sorry!

I didn’t know about this! However the behaviour does not seem to change if I register the optimiser before doing .cuda(). Is this expected?

Thanks also for the example code, I didn’t consider that option. But isn’t this going to behave very similarly to just do the naive parallelisation that I was doing to start with (see first post)?

Data loading and processing could certainly be responsible for a large CPU overhead, but note that it would depend on the actual CPU vs. GPU workload.
Even just kernel launches could show a visible CPU overhead if the actual GPU workload is tiny.
Think about a theoretical workload where each kernel launch takes 1ms while the GPU workload takes 1us.
Even if the GPU could run these kernels in parallel your CPU will never be fast enough to schedule these kernels fast enough, which is why I recommended to profile your workload using a visual profiler.

I see, that would explain also why my GPU memory usage is so tiny. I’ll give a go with the profiler then.

If CPU overheads turn out to be an issue, what options do I have? It still takes only about 2 minutes to train my ensemble (some ~25s per NN), but I have to do it for 10 of these ensembles, each regressing to different variables. So that’s 20 minutes on a single GPU. I have to do Active Learning with this, and need to retrain about 300 times. This is very expensive without using a separate GPU for each ensemble. And even then, if the GPU isn’t doing much, it’s a lot of wasted GPU time…

If the kernel launches are indeed visible, you could use CUDA Graphs as described here and here. Note that this workflow has a few limitations, such as static input shapes, but once the graph is recorded the replay would be cheap as you would only launch it once instead of each tiny kernel separately.