Efficent way of passing same input through many models

I want to pass the same input vector through a number of modules. These modules all have the same architecture and simply vary in their weights.
Currently i am doing it like this:

    def forward(self, modules):
        embeddings = []
        for module in modules:

        return torch.vstack(embeddings)

This is working fine in principal, but is very slow as i do not make use of any parallelization.
I have only one machine with one GPU.
Does somebody know a more efficient was of doing this?

Often the most efficient way to use a GPU is to leverage a single kernel at a time doing a lot of work. If the utilization of your GPU is high with this approach, then there likely isn’t a way to speed things up (trying to parallelize things by running multiple kernels in parallel would likely slow things down with the increased contention). If the utilization is low, you might see if you can increase the amount of data parallelism (e.g., batch size).

1 Like

In case anybody ever stumbles upon a similar task. I found a way, which is pretty specific to what i need.
Basically as the nets i am using are all fully connected sequential Networks, i can manually execute the forwards passes via batched matrix multiplications, which can take use of cudas parallelization again.

here a snippet of my code:

    def forward(self, named_parameters):
        batch_size = named_parameters[self.weight_keys[0]].shape[0]
        forward_values = self.in_states.repeat(batch_size,1,1)

        for i in range(len(self.weight_keys)):
            linear_transformed = forward_values @ torch.transpose(named_parameters[self.weight_keys[i]], dim0=1, dim1=2)
            forward_values = linear_transformed + named_parameters[self.bias_keys[i]].unsqueeze(1)
            if(self.num_activations > i):
                forward_values = self.activations[i](forward_values)

        return forward_values.reshape(batch_size, -1)

named_parameters here are the named parameters of the modules i want to pass the same input (in_states) through. They are in a special format where within each dictionary entry there is a whole batch of this specific layer parameters. (here the batch refers to number of modules).
self.weigh_keys and self.bias_keys contain the keys of weights and bias parameters in the correct order.
And self.activations the activations of the module.
As said this is very specific and only works if all models have the same architecture and are of a fully connected sequential type.
This drastically increased execution speed (when using cuda) over the implementation i showed in my question.