Faster for-loop for ensemble

I am working with ensembles of neural networks, roughly of the following form:

class NNEnsemble():
    def __init__(self, n_estimators) :
        self.estimators = nn.ModuleList([ generate_model() for _ in range(self.n_estimators)])

    def forward(self, x):
        out = torch.stack([est(x) for est in self.estimators], dim=1)
        return torch.sum(out, dim=1)

Here, generate_model() generates a small base model (e.g. MLP 3 layers with 128 hidden neurons). As it turns out, the list-comprehension/for-loop during forward is rather slow. In fact, I only utilize < 10% GPU for an ensemble with 100 networks of small size. Is there any way to speed-up this operation, e.g. by using a “parallel” list comprehension?

I found one post about grouped convolution which mentions a similar structure, but there was no real solution besides waiting for cudnn support.


Unfortunately if you do many small operations, it’s going to be fairly slow. The overhead of starting the work on the GPU might be larger than the time you spend actually computing.
You can try to run on the CPU the small part of your model, increase the batch size to give more work to your gpu or find a way to agregate your estimators into a single one that performs one big operation.