Fast computation with pytorch

I am doing this numpy computation at the moment which is very, very slow. Something like:

cutoff = self.theta
X_flat = X.reshape((X.shape[0], X.shape[1] * X.shape[2]))
weightfun = lambda x: 1.0 / np.sum(, x) /, x) > 1 - cutoff)
# This is a very expensive operation. 
N_list = np.array(list(map(weightfun, X_flat)))

Assuming I can move everything to pytorch tensors and use the GPU, how could one compute a faster version of this.
Currently it computes the dot product between every column vector and is a N^2 operation and just takes many many hours on my laptop.