Trying to create parallel RNN technical proof of concept

This is a technical proof of concept prior to experimenting with an MoE RNN.

Essentially, I want to run two RNNs in parallel on the same input sequence. The caveat is that the number of parallel RNNs to run is unknown - thus the for loop.

class pRNN(nn.Module):

    def __init__(self, input_size=0, hidden_size=0, num_layers=1, bidirectional=False, dropout=0, subunit_count=1):

        super().__init__()

        self.subunit_count = subunit_count

        subunit_size = math.ceil(hidden_size / subunit_count)

        self.rnn = []
        hidden_size_remaining = hidden_size
        for i in range(self.subunit_count):
            self.rnn.append(nn.GRU(input_size=input_size, hidden_size=min(hidden_size_remaining, subunit_size), num_layers=num_layers, bidirectional=bidirectional, dropout=dropout))

        self.rnn = nn.ModuleList(self.rnn)

    def forward(self, x):

        out = None
        for i in range(self.subunit_count):
            if out is None:
                out, hidden = self.rnn[i](x)
            else:
                out2 = self.rnn[i](x)
                out = torch.cat((out, out2[0]), dim=-1)
                hidden = torch.cat((hidden, out2[1]), dim=-1)

        return out, hidden

Unfortunately, this is maybe the fifth failed attempt to come up with a technique that runs in parallel - no matter what I try it runs sequentially (including reducing the already minimal batch size of 128 by the number of RNNs). I am running on a single V100.

Any suggestions are appreciated.

You could check, if your GPU has any resources left for a parallel execution of the RNNs.
RNNs are using matrix multiplications internally, which are executed via cublas on the GPU. Generally matmuls tend to saturate the device as much as possible, so that other kernels might have to wait for their execution. You could check the profile via e.g. Nsight Systems and also take a look at e.g. this post.