Given a 2d matrix of size (2000x1000) i need to compute the outer product of each row with itself. Finally all the outer products must be averaged. What is the fastest/most efficient way of doing so?

This is by far the biggest bottle neck in my program. Ive come up with 2 solutions. The 2. of which, against all intuition, is way faster than the first. Any help to improve the speed of this computation will be greatly appriciated. The computation will be performed by the GPU if its of any relevance.

Slowest:

```
def avg_matrix_outer_products_v1(a):
x_dim, y_dim = a.shape
ourter_products = torch.matmul(a.view(x_dim, y_dim, 1), a.view(x_dim, 1, y_dim)).T
return torch.mean(ourter_products , 2)
```

This method can run into memory issues but this is easily fixed by splitting â€śaâ€ť into submatracies with only 250-500 rows and calling the function multiple times.

Fastest:

```
def avg_matrix_outer_products_v2(a):
x_dim = a.shape[0]
ourter_products = torch.outer(a[0], a[0])
for j in range(1, x_dim):
ourter_products += torch.outer(a[j], a[j])
return ourter_products / x_dim
```