What would be faster? I thought I might be able to pair matrices and parallelise the computation with batched matrix multiplications, but it was slower: https://gist.github.com/gngdb/70fce4f27cdaeeb3f8f18cf9929e60d3

Hi gngdb, the recursive function you are using has an overhead for concatenation and slicing. Hence, for reducing smaller lists, the naive reduce may be faster. However, for list lengths of 128, I got the following result.

```
functools:
CPU: 2.2555802155067415
GPU: 3.1399718035379047
recursive:
CPU: 10.128527292804142
GPU: 1.773943289081565
```

So your batching method scales better then the naive reduce.

Ah, that makes sense, thanks.

Looking at this again today, there’s actually a massive mistake. The test to check that `functools_reduce`

and `recursive_reduce`

match was wrong, it was just testing to see that `recursive_reduce`

and `recursive_reduce`

were equal. Fixing that, I realised that the resulting matrix explodes, which makes the test a little difficult, so have to scale by sqrt of M to keep approximately unit variance.