# Fisher-Markov Algorithm: GPU version runs slower than CPU version

I am trying to implement Fisher-Markov Algorithm using PyTorch (Eq. 19).

This is my CPU implementation:

``````import numpy as np

def fisher(X, y, n_class=10, n_features=30810, gamma=-0.5, beta=0):

theta = []
n = X.shape

C = [[] for _ in range(n_features)]
for j,c in enumerate(y):
C[int(c)].append(int(j))

for j in range(n_features):
# part 1
k = 0
for g in range(n_class):
# Get index class-g
i = np.asarray(C[g])
k += (1 / len(i)) * np.sum(np.dot(np.asarray([X[i,j]]).transpose(),[X[i,j]]))
k *= (1 / n)

# part 2
l = np.dot(X[:,j], X[:,j])
l *= (gamma / n)

# part 3
m = np.sum(np.dot(np.asarray([X[:,j]]).transpose(), [X[:,j]]))
m *= (gamma - 1) / (n * n)

theta.append((k - l + m, j))

theta = np.asarray(sorted(theta, key=lambda x:x, reverse=True),dtype=int)

return theta[:beta]
``````

and this is my GPU version:

``````import numpy as np
import torch

def fisher(X, y, n_class=10, n_features=30810, gamma=-0.5, beta=0):

theta = Variable(torch.Tensor(*n_features)).cuda()
C = [[] for _ in range(n_class)]
for j,c in enumerate(y.data):
C[int(c)].append(int(j))
n = X.size()
C = [torch.from_numpy(np.asarray(ii, dtype=int)).cuda() for ii in C]
L = torch.sum(X*X,0)

for j in range(n_features):
# part 1
k = 0
for g in range(n_class):
i = C[g]
u = X[i][:,j].contiguous().view(-1,X[i][:,j].size())
k += (1 / i.size()) * torch.sum(torch.mm(torch.t(u),u))
k *= (1 / n)

# part 2
l = L[j]
l = l * (gamma / n)

# part 3
v = X[:][:,j].contiguous().view(-1, X[:][:,j].size())
m = torch.sum(torch.mm(torch.t(v), v))
m = m * ((gamma - 1) / (n * n))

res = k - l + m
theta[j] = res.data

theta = torch.sort(theta, 0, descending=True)

return theta[:beta]
``````

And the result shows GPU vesion much slower than CPU version. I hope somebody here can give some advices or show me what kind of improvement I can do in this source code, so my GPU version can run faster.

I use pytorch version 0.2.0. `n_features` is around 20,000, `n` is around 1000, and `n_class` is 10. My implementation may slightly different with original equation.

Thank You

The GPU code could be faster if you vectorize (or batch) some operations. For example, for loops are generally slow because you’d launch a few cuda kernels per loop of the for loop, for every loop. Because your `n_class` and `n_features`are large, this means many cuda kernels get launched (and launching these things have overhead).

``````        k = 0
for g in range(n_class):
i = C[g]
u = X[i][:,j].contiguous().view(-1,X[i][:,j].size())
k += (1 / i.size()) * torch.sum(torch.mm(torch.t(u),u))
k *= (1 / n)
``````

I’m not really sure what you’re trying to do here but if you can replace the for loop purely with operations on the `Variable`, it’ll be faster to run with the GPU.