I am doing this numpy computation at the moment which is very, very slow. Something like:

```
cutoff = self.theta
X_flat = X.reshape((X.shape[0], X.shape[1] * X.shape[2]))
weightfun = lambda x: 1.0 / np.sum(np.dot(X_flat, x) / np.dot(x, x) > 1 - cutoff)
# This is a very expensive operation.
N_list = np.array(list(map(weightfun, X_flat)))
```

Assuming I can move everything to pytorch tensors and use the GPU, how could one compute a faster version of this.

Currently it computes the dot product between every column vector and is a N^2 operation and just takes many many hours on my laptop.