Running independent `nn.Module` instances in `nn.ModuleList` truly in parallel in PyTorch

I have a PyTorch model that consists of multiple independent FullyConnectedNetwork instances stored inside an nn.ModuleList. Here’s the code:

import torch.nn as nn

class FullyConnectedNetwork(nn.Module):
    def __init__(self):
        super(FullyConnectedNetwork, self).__init__()
        self.fc1 = nn.Linear(20, 10)
        self.fc2 = nn.Linear(10, 1)
    
    def forward(self, x):
        x = self.fc1(x)
        x = self.fc2(x)
        return x

class ParallelFCN(nn.Module):
    def __init__(self, n):
        super(ParallelFCN, self).__init__()
        self.models = nn.ModuleList([FullyConnectedNetwork() for _ in range(n)])
    
    def forward(self, x):
        outputs = [model(x[:, i*20:(i+1)*20]) for i, model in enumerate(self.models)]
        return torch.cat(outputs, dim=1)

# Example usage:
n = 1000
model = ParallelFCN(n)
print(model)

Currently, I’m using a for-loop to pass data through each FullyConnectedNetwork instance. However, I realize that this approach is not truly parallel in a software sense.

Given that each FullyConnectedNetwork is independent of the others, is there a way to run them truly in parallel, perhaps using multi-threading, multi-processing, or any other method in PyTorch?

I need it because the number of my modules can get really big, as big as 400, and processing then using a for loop is very slow.

Disclaimer: I am NOT fully familiar with the internals of PyTorch or CUDA, so I may be wrong in my implementations below, or the interpretations of my test results. I am however interested in the question, so let’s see if I can at least trigger Cunningham’s Law.

I’ll assume you would run your code on GPUs, since you care about performance, and I’ll assume those GPUs are CUDA-enabled, as that’s what’s most common (and what I’m most familiar with).

When looking at this issue, I had different ideas:

  1. In pytorch, by default, GPU operations are asynchronous. That is already a form of parallelism. Is it good enough? (TLDR: no)
  2. As far as I can tell, cuda streams are meant to address this exact problem… (TLDR: in this case, it does not seem to help)
  3. torch.jit and torch.compile can be used to compile a module. Perhaps those compilations could detect and optimize these parallel flows? (TLDR: the jit.script model does run a bit faster, but not by orders of magnitude)
  4. Ultimately, each nn.Linear should represent 1 “cycle” on the GPU’s matrix multiplication hardware… with n=1000, that’d be 2000 cycles. I think these could be refactored into 2 convolutions instead in total, requiring 2 GPU cycles only. (TLDR: This is MUCH faster)

Let me go more into details for each of these. For starters, here is my benchmarking input and function:

import time
n = 1000
input_ = torch.rand(1, n * 20).cuda()
def benchmark(model, input_, warmup=5, steps=30):
    for _ in range(warmup):
        model(input_)
    torch.cuda.synchronize()
    start = time.perf_counter()
    for _ in range(steps):
        model(input_)
    torch.cuda.synchronize()
    end = time.perf_counter()
    print(f"{model.__class__.__name__}: {end - start:.3f} s")

1. Asynchronous GPU operations (enabled by default)

GPU operations are asynchronous, allowing your CPU execution to get ahead of the GPU’s operations… although if you only use 1 GPU, you still can only do 1 operation on it at a time… Still, perhaps pytorch natively optimizes for these kinds of situations, and I could measure that?

class ParallelFCN(nn.Module):
    def __init__(self, n):
        super(ParallelFCN, self).__init__()
        self.models = nn.ModuleList([FullyConnectedNetwork() for _ in range(n)])

    def forward(self, x):
        outputs = []
        for i, model in enumerate(self.models):
            outputs.append(model(x[:, i * 20:(i + 1) * 20]))
        return torch.cat(outputs, dim=1)


class StepWisePFCN(ParallelFCN):
    def forward(self, x):
        outputs = []
        for i, model in enumerate(self.models):
            outputs.append(model(x[:, i * 20:(i + 1) * 20]))
            torch.cuda.synchronize()
        return torch.cat(outputs, dim=1)

model = ParallelFCN(n).cuda()
benchmark(model, input_)

model = StepWisePFCN(n).cuda()
benchmark(model, input_)

That produces the following output. Clearly, the asynchronous execution already offers some extra speed by default, but the difference is not several orders of magnitude as we should expect from 1000 executions in parallel:

ParallelFCN: 0.730 s
StepWisePFCN: 1.057 s

2. CUDA streams

I’ve actually never used these, but I adapted the code here:

class MultiStreamPFCN(ParallelFCN):
    def forward(self, x):
        outputs = []
        for i, model in enumerate(self.models):
            with torch.cuda.stream(torch.cuda.Stream()):
                outputs.append(model(x[:, i * 20:(i + 1) * 20]))
        torch.cuda.synchronize()
        return torch.cat(outputs, dim=1)

model = MultiStreamPFCN(n).cuda()
benchmark(model, input_)

This reports 1.030 s execution, actually slower than the original version. Perhaps the overhead from context switching between different streams is not worth it in this case…

MultiStreamPFCN: 1.079 s

3. TorchScript / torch.compile optimizations

model = ParallelFCN(n).cuda()
model = torch.jit.script(model)
benchmark(model, input_)

model = ParallelFCN(n).cuda()
model = torch.compile(model)
benchmark(model, input_)

Outputs are slightly faster with jit.script, but still far from what we could hope for:

RecursiveScriptModule: 0.644 s
OptimizedModule: 0.764 s

4. Refactoring into nn.Conv1d equivalent

If I am not wrong, using nn.Conv1d layers with k=1 and groups=n should give you an equivalent to your n parallel linear layers:

class ConvolutionalPFCN(nn.Module):
    def __init__(self, n):
        super(ConvolutionalPFCN, self).__init__()
        self.layer1 = nn.Conv1d(n*20, n*10, 1, groups=n)
        self.layer2 = nn.Conv1d(n*10, n*1, 1, groups=n)

    def forward(self, x):
        x = torch.unsqueeze(x, -1)
        x = self.layer1(x)
        x = self.layer2(x)
        x = torch.squeeze(x, -1)
        return x

model = ConvolutionalPFCN(n).cuda()
benchmark(model, input_)

This gives a MUCH faster execution time (at least 100x), although this solution is specific to converting linear layers, and not generic to parallelizing any operation

ConvolutionalPFCN: 0.006 s

CUDA operations are asynchronous w.r.t. the CPU by default. If you launch multiple kernels the CPU can run ahead (assuming it’s fast enough and the GPU workload is large enough). This does not mean the GPU work is done and your profiling just shows when the host arrives at the line of code stopping the timer. If you would synchronize now or access the result of the computation, PyTorch will implicitly synchronize the code.

Your first approach is showing overheads of multiple synchronizations inside the loop, which should also be visible in a profiler. Use the native profiler or Nsight Systems to see the actual execution and the blocking operations including white spaces between the kernels.

Using custom CUDAStreams allows you to execute kernels in parallel on the GPU. To do so enough compute resources would be needed as kernels are generally written in a way to saturate all compute resources.
Take a look at this post and GTC presentation for more details.

Thanks for your informed reply. I’ll definitely go through that presentation video when I get the time.
Regarding approach 4(convolutions), would you say it is still mostly valid?