Running multiple Modules in parallel

I am implementing a multi-head network. (This is to implement multi-head DQN, a specific reinforcement learning method, but this doesn’t really matter here.)

My network has the following architecture:

input -> 128x (separate fully connected layers) -> output averaging

I am using a ModuleList to hold the list of fully connected layers. Here’s how it looks at this point:

class MultiHead(nn.Module):
    def __init__(self, dim_state, dim_action, hidden_size=32, nb_heads=1):
        super(MultiHead, self).__init__()
        self.networks = nn.ModuleList()
        for _ in range(nb_heads):
            network = nn.Sequential(
                nn.Linear(dim_state, hidden_size),
                nn.Linear(hidden_size, dim_action)
        self.optimizer = optim.Adam(self.parameters())

Then, when I need to calculate the output, I use a for ... in construct to perform the forward and backward pass through all the layers:

q_values =[net(observations) for net in self.networks])

# skipped code which ultimately computes the loss I need


This works! But I am wondering if I couldn’t do this more efficiently. By doing a, I am actually going through each separate FC layer one by one, and as a result the training time grows with the number of FC layers.

Can this operation could be done in parallel?

This is similar to this (unanswered) question: Parallel execution of modules in nn.ModuleList

1 Like

I have tried with CUDA streams but I still see a big slowdown when scaling up the number of heads:

    streams = [torch.cuda.Stream() for _ in range(nb_heads)]
    losses = []
    net_idx = 0
    for net in self.networks:
            q_values = net(observations)
            q_values = q_values.gather(1, actions.unsqueeze(1)).squeeze(1)

            next_q_values = net(next_observations)
            next_q_values = next_q_values.max(1)[0]

            expected_q_values = rewards + gamma * next_q_values * (1 - terminals)
            losses.append((q_values - expected_q_values.detach()))
        net_idx += 1
    loss =

I have also tried to use a pool of CUDA streams to avoid having to recreate them at each timestep, and I have removed the torch.cuda.synchronize()s which don’t seem to be useful.

But still, the training speed decreases linearly with the number of heads :frowning:

Another option may be to use (abuse?) a Conv1d layer, using the groups argument to handle the parallel network simultaneously, and using a kernel size of 1 to get it to act as a a Linear layer.

Would that be equivalent? Should I expect a speedup from this approach?

I posted this in the other discussion thread, but it seems that the other option that @MasterScrat mentioned is in fact faster if and only if you have a GPU.