Speed of reduce(torch.matmul, matrix_list)

What would be faster? I thought I might be able to pair matrices and parallelise the computation with batched matrix multiplications, but it was slower: https://gist.github.com/gngdb/70fce4f27cdaeeb3f8f18cf9929e60d3

Hi gngdb, the recursive function you are using has an overhead for concatenation and slicing. Hence, for reducing smaller lists, the naive reduce may be faster. However, for list lengths of 128, I got the following result.

  CPU:  2.2555802155067415
  GPU:  3.1399718035379047
  CPU:  10.128527292804142
  GPU:  1.773943289081565

So your batching method scales better then the naive reduce.

Ah, that makes sense, thanks.

Looking at this again today, there’s actually a massive mistake. The test to check that functools_reduce and recursive_reduce match was wrong, it was just testing to see that recursive_reduce and recursive_reduce were equal. Fixing that, I realised that the resulting matrix explodes, which makes the test a little difficult, so have to scale by sqrt of M to keep approximately unit variance.