Ensemble 1000 small models into one efficiently

Hi, I’m trying to create a model that is kind of ensemble of more than 1000 small models. Each small model takes as input same vector, but process it with a different mask (input vectors are very sparse).

class MiniModel(nn.Module):
    def __init__(self, clusters_set):
        super(Nut, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(len(clusters_set), 1),
        self.mask = nn.Parameter(self.get_cluster_mask(clusters_set), requires_grad=False)

    def apply_mask(self, x):
        mask = self.mask.expand(x.shape[0], self.mask.shape[-1])
        return torch.masked_select(x, self.mask).view(x.shape[0], len(self.mask.nonzero()))
    def forward(self, x):
        x = self.apply_mask(x)
        x = self.fc(x)
        return x

mask is binary tensor which is different for every MiniModel.
Then there is a model which creates and ensembles them together:

class Tree(nn.Module):
    def __init__(self, settings_dict):
        super(Tree, self).__init__()
        self.tree = {
        for category, clusters in setting_dict.items():
            self.tree[str(category)] = MiniModel(clusters)
        self.tree_nn = nn.ModuleList(self.tree.values())
    def forward(self, x):
        return torch.cat([model(x.clone()) for model in self.tree_nn], dim=1)

Question: The problem is that the training process is super inefficient - because the number of models is too big, it trains very slowly. Is there any possibility to make forward through all of MiniModels in Tree simultaneously? Just like one big vectorized multiplication?