Ensemble learning: Parallelize models or computations?

Hi all,

I am working on an ensemble of deep cnns with 4 GPUs.

I would like to know whether is better/faster to train each model parallelized in the 4 devices one after the other sequentially, or train each model (suppose ensemble size = 4) on each GPU.

Note: All the individual networks are exposed to the entire dataset.

My current code is:

cuda = torch.cuda.is_available()
device = 'cuda' if torch.cuda.is_available() else 'cpu'
gpus = True if torch.cuda.device_count() > 1 else False

# torch.cuda.device_count() = 4

ensemble = []
optimizers = []
for i in range(ensemble_size):
    model = ResNet()
    optimizers.append(optim.SGD(model.parameters(), learning_rate))
    if gpus: model = nn.DataParallel(model)

If it is better each model on each GPU, would this be correct?

    if gpus:
        with torch.cuda.device(i):
            # model = nn.DataParallel(model) ## I can't parallelize the batches know right?
  • With model.to(device) am I parallelizing each model in the 4 GPUs of the device?
  • I cannot make use of nn.DataParallel(model)to distribute the batches on the GPUs since each GPU has to run the entire dataset for each model right?

Thanks a lot and sorry for the long post!