Efficient ensemble training on multiple GPU

Hey,

I am trying to train an ensemble of models on a dataset. At the moment I am using a wrapper, which wraps all my models of the ensemble into a Module:

class Ensemble(nn.Module):

    def __init__(self, models):
        super().__init__()
        self.models = models

    def forward(self, x):
        y = []
        for model in self.models:
            y.append(model(x))
        return y

This works well, but I think it is not very efficient, because the for-loop does not parallelize well and I still need a lot of memory for the backward-pass, since all results are collected in a single loss function. What would be the best way, to make the training of this ensemble more efficient?
I was thinking of separating the models during training and putting them on different GPUs, but I do not enough GPUs to put every model on a single GPU…
Is it somehow possible to loop over the dataset and over the models in order to efficiently distribute the training over the GPUs? Is there maybe a totally different approach?

Thanks for helping!

Did you find a way to make it efficient?