Fisher-Markov Algorithm: GPU version runs slower than CPU version

I am trying to implement Fisher-Markov Algorithm using PyTorch (Eq. 19).

This is my CPU implementation:

import numpy as np

def fisher(X, y, n_class=10, n_features=30810, gamma=-0.5, beta=0):
    
    theta = []
    n = X.shape[0]

    C = [[] for _ in range(n_features)]
    for j,c in enumerate(y):
        C[int(c)].append(int(j))

    for j in range(n_features):
        # part 1
        k = 0
        for g in range(n_class):
            # Get index class-g
            i = np.asarray(C[g])
            k += (1 / len(i)) * np.sum(np.dot(np.asarray([X[i,j]]).transpose(),[X[i,j]]))
        k *= (1 / n)

        # part 2
        l = np.dot(X[:,j], X[:,j])
        l *= (gamma / n) 

        # part 3
        m = np.sum(np.dot(np.asarray([X[:,j]]).transpose(), [X[:,j]]))
        m *= (gamma - 1) / (n * n)

        theta.append((k - l + m, j))

    theta = np.asarray(sorted(theta, key=lambda x:x[0], reverse=True),dtype=int)

    return theta[:beta]

and this is my GPU version:

import numpy as np
import torch
from torch.autograd import Variable

def fisher(X, y, n_class=10, n_features=30810, gamma=-0.5, beta=0):
    
    theta = Variable(torch.Tensor([0]*n_features)).cuda()
    C = [[] for _ in range(n_class)]
    for j,c in enumerate(y.data):    
        C[int(c)].append(int(j))
    n = X.size()[0]
    C = [torch.from_numpy(np.asarray(ii, dtype=int)).cuda() for ii in C]
    L = torch.sum(X*X,0)

    for j in range(n_features):
        # part 1
        k = 0
        for g in range(n_class):
            i = C[g]
            u = X[i][:,j].contiguous().view(-1,X[i][:,j].size()[0])
            k += (1 / i.size()[0]) * torch.sum(torch.mm(torch.t(u),u))
        k *= (1 / n)
        
        # part 2
        l = L[j]
        l = l * (gamma / n) 


        # part 3
        v = X[:][:,j].contiguous().view(-1, X[:][:,j].size()[0])
        m = torch.sum(torch.mm(torch.t(v), v))
        m = m * ((gamma - 1) / (n * n))
        
        res = k - l + m
        theta[j] = res.data

    theta = torch.sort(theta, 0, descending=True)[1]
    
    return theta[:beta]

And the result shows GPU vesion much slower than CPU version. I hope somebody here can give some advices or show me what kind of improvement I can do in this source code, so my GPU version can run faster.

I use pytorch version 0.2.0. n_features is around 20,000, n is around 1000, and n_class is 10. My implementation may slightly different with original equation.

Thank You

The GPU code could be faster if you vectorize (or batch) some operations. For example, for loops are generally slow because you’d launch a few cuda kernels per loop of the for loop, for every loop. Because your n_class and n_featuresare large, this means many cuda kernels get launched (and launching these things have overhead).

        k = 0
        for g in range(n_class):
            i = C[g]
            u = X[i][:,j].contiguous().view(-1,X[i][:,j].size()[0])
            k += (1 / i.size()[0]) * torch.sum(torch.mm(torch.t(u),u))
        k *= (1 / n)

I’m not really sure what you’re trying to do here but if you can replace the for loop purely with operations on the Variable, it’ll be faster to run with the GPU.