Most efficient way to execute Linear layers in parallel

Hello there,

I have a lot of linear layers (up to 12000) with variable input (1-100) and an output of 1. I currently see two ways to tackle this:

  1. Loop through each layer (Low memory consumption ,slow)
  2. Merge them to a big layer using torch.block_diag on all weight matrices (High memory consumption, fast inference but slow when calculating gradients)
  3. Find a middle ground and combine some of the layers

The problem with 2 is that I cant directly create a sparse matrix and so I have to first create this big > 100000x100000 matrix with a lot of zeros.

Does anyone here know of a more efficient/better solution for this problem?
Thanks :slight_smile:

Loop through each layer (Low memory consumption ,slow)

I guess you are trying to launch a lot of small kernels in a loop, which would then suffer from the actual dispatching and kernel launch overheads. The CPU would thus not be able to run ahead and schedule all kernels fast enough which would show short periods of high GPU utilization with drops.
If so, then consider using CUDA graphs as it could capture all your calls and replay them in a single go, freeing CPU resources and thus avoiding the CPU-limt.
Have a look at this blog post for more information and these docs on how to use it.