Surprisingly low performance in pseudo-Linear layer

I am trying to understand why making a PseudoLinear layer that does not actually do matmul and instead performs element-wise multiplication by weight vector is not proportionally faster (e.g. on a 160-wide layer I expect ~10x speedup, but in practice it is negligible), and what can be done about it. Here’s my experiment:

The above code prints
100%|██████████| 400/400 [00:38<00:00, 10.48it/s]
real: 38.16433668136597s
100%|██████████| 400/400 [00:35<00:00, 11.29it/s]
fake: 35.41974401473999s

In theory, the speedup should be of order of 160x. Say matmul is 16x accelerated by tensor cores (IMHO, a generous assumption). Why am I not seeing 10x? The speedup becomes more noticeable as I widen layers to 1024, but still the asymptotic is quite off (should have been ~4x difference between the two; (10241024)/(256256)):
100%|██████████| 40/40 [01:05<00:00, 1.63s/it]
real: 65.22167682647705s
100%|██████████| 40/40 [00:21<00:00, 1.86it/s]
fake: 21.476550817489624s

I am able to achieve larger speedups for larger kernel sizes.

For example:

Width    Speedup
128 	 0.9691576570785311
256 	 0.9655162001068844
512 	 1.0440890002830863
1024 	 1.1333492156383425
2048 	 1.1409784764893438
4096 	 2.3043928215624705
8192 	 7.894285981654935
16384 	 31.74388522870358
24576 	 69.13806393385168

This won’t be the same between different accelerators though. There are results from another GPU:

Width    Speedup
128 	 0.6435953703453665
256 	 0.9463080602107735
512 	 0.4413361923750669
1024 	 1.5441994530390386
2048 	 3.5002745561894955
4096 	 15.720719325990684
6144 	 34.979133409494324

You can improve the performance of small kernels using CUDA Graphs.

Results from the same device but using torch.cuda.make_graphed_callable:

Width    Speedup
128 	 0.8964585760691058
256 	 1.056850138382113
512 	 1.7432108905596932
1024 	 3.1071983489860946
2048 	 8.761643700438167
4096 	 32.27259183628145
6144 	 68.84583575358431

It would be interesting to know where this Speedup converges to. Figured it out, it’s infinity.

I mean, theoretically, the speedup (or rather difference is speed, cause the substitute is nothing like Linear) should be exactly the size of the input width (N) less a small constant for bias. Matmul does NxN multiplications and this thing does only N both for forward and backward passes, which you can see for larger widths.

It is curious that even on larger sizes the speed difference is all over the place, but I guess this may have something to do with memory cache hierarchy.

You’re right, I assumed this is a small kernel issue and didn’t really think about it.

In your code:

  1. You include random data generation inside the function you are timing
  2. You execute x * self.weight + self.bias + x. I assume the trailing + x is a typo
  3. nn.ReLU() is adding a constant factor to both variants. It’s best to remove it for benchmarking.

Notice that nn.Linear executes forward that is written entirely in C++/CUDA. In your PseudoLinear each operator is executed in CUDA, but multiplication and addition are executed as separate kernels, which makes for an overhead. You can use JIT to mitigate this issue a little. You won’t see any benefit from using JIT in case of nn.Linear based model though.

After fixing the above, I get the following performance scaling:

I used batch size of 32768 to fully utilize the GPU even for a small number of features.

Still not what we would expect looking at the number of operations. Another reason might be related to the highly optimized matrix multiplication algorithm. For mmul, there are algorithms that need less scalar multiplications than the naive algorithm. This makes Linear and PseudoLinear closer to each other in terms of the number of multiplications performed.

1 Like