Speed up custom function and model using GPU

Hello everyone,

I am working on implementing some custom function (and its corresponding module) that will be part of the sequential execution in a layer. When running it I get a very bad performance in terms of speed of training. I even commented the computations and left only the loops and the calls to the custom function, and even then it is still super slow. It could be that the looping is the main cause. This seems unlikely, however, since the size of the tensor is [20, 1, 28, 28] and [20, 32, 12, 12], corresponding to the two layers. The most interesting parts of the code are the following:

class FrankWolfe_MinCutSum_Canonical(Function):
    @staticmethod
    def in_bounds(ctx, coord):
        (x, y) = coord
        return 0 <= x < ctx.width and 0 <= y < ctx.height

    @staticmethod
    def neighbors(ctx, coord):
        (x, y) = coord
        return list(filter(lambda p: FrankWolfe_MinCutSum_Canonical.in_bounds(ctx, p), [(x+1, y), (x, y-1), (x-1, y), (x, y+1)]))

    @staticmethod
    def relaxed_taylor_closed_form_solution(ctx, w):
        u_star = ctx.beta
        for x in range(ctx.width):
            for y in range(ctx.height):
                i = (x, y)
                u_star[i] -= ctx.alpha*sum([(1 if w[i] > w[j] else -1) for j in FrankWolfe_MinCutSum_Canonical.neighbors(ctx, i)])
        return u_star

    @staticmethod
    def forward(ctx, beta, alpha=1, tol=1e-6, max_iter=20):
        # print(beta)
        ctx.height, ctx.width = list(beta.shape)
        ctx.alpha, ctx.beta = alpha, beta
        v = beta
        # for t in range(max_iter):
        #     w = v.exp().div(v.exp().add(1))
        #     u_star = FrankWolfe_MinCutSum_Canonical.relaxed_taylor_closed_form_solution(ctx, w)
        #     gamma_t = 2 / (t + 2)
        #     v_1 = v
        #     v = (1 - gamma_t)*v_1 + gamma_t*u_star
        #     if torch.norm(v - v_1) < tol:
        #         break
        ctx.save_for_backward(v.clone())
        return v

    @staticmethod
    def backward(ctx, grad_output):
        v, = ctx.saved_variables
        # return v * grad_output
        return grad_output

class LateralInteractions(nn.Module):
    def __init__(self):
        super(LateralInteractions, self).__init__()

    def forward(self, x):
        out = x
        batch_size, channels, _, _ = list(x.shape)
        for b in range(batch_size):
            for c in range(channels):
                # continue
                # out[b, c] = FrankWolfe_MinCutSum_Canonical.apply(Variable(x[b, c], requires_grad=True))
                out[b, c] = FrankWolfe_MinCutSum_Canonical.apply(x[b, c])
        return out

I left in the comments so that you can see that right now the function is basically passing the input as output without changing anything. Then the network looks as follows:

class ConvNetLat(nn.Module):
    def __init__(self, n = 10):
        super(ConvNetLat, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=7, stride=2, padding=1),
            LateralInteractions())
        self.layer2 = nn.Sequential(
            nn.Conv2d(32, n, kernel_size=7, stride=1, padding=1),
            LateralInteractions(),
            nn.AdaptiveAvgPool2d(1))
        self.log_softmax = nn.LogSoftmax(dim=1)

    def forward(self, x):
        # [20, 1, 28, 28]
        out = self.layer1(x)
        # [20, 32, 12, 12]
        out = self.layer2(out)
        # [20, 10, 1, 1]
        out = out.reshape(out.size(0), -1)
        out = self.log_softmax(out)
        return out

What could I do to improve the training performance? This is literally taking hours to train, as is (with commented code and all).

Any help will be greatly appreciated.

Most probably it’s the for loops. For better performance it’s always better do vector operations.
so the line -
u_star[i] -= ctx.alpha*sum([(1 if w[i] > w[j] else -1) for j in FrankWolfe_MinCutSum_Canonical.neighbors(ctx, i)])

if you can write it as a vector operation without a for loop it should be much faster.

Done. Changed the following:

    def relaxed_taylor_closed_form_solution(ctx, w):
        u_star = ctx.beta
        u_star -= (torch.cuda if torch.cuda.is_available() else torch).FloatTensor([ \
                    [ctx.alpha*sum([(1 if w[(x, y)] > w[j] else -1) for j in FrankWolfe_MinCutSum_Canonical.neighbors(ctx, (x, y))]) \
                    for y in range(ctx.height)] for x in range(ctx.width)])
        return u_star

and

class LateralInteractions(nn.Module):
    def __init__(self):
        super(LateralInteractions, self).__init__()

    def forward(self, x):
        # out = x
        batch_size, channels, _, _ = list(x.shape)
        out = torch.stack([ \
                torch.stack([FrankWolfe_MinCutSum_Canonical.apply(Variable(x[b, c], requires_grad=True)) \
                for c in range(channels)]) for b in range(batch_size)])
        # out[b, c] = FrankWolfe_MinCutSum_Canonical.apply(x[b, c])
        return out

Still super slow though. Any other recommendations? Maybe consider running in parallel could improve enough?

You are still running a for loop when you are doing for loop in the stack, i think you can do it one go for c and b and then do another stack. Those for loops seems to be the bottleneck.

Is this the correct way?

    def relaxed_taylor_closed_form_solution(ctx, w):
        u_star = ctx.beta
        u_star -= (torch.cuda if torch.cuda.is_available() else torch).FloatTensor([ \
                    ctx.alpha*sum([(1 if w[(x, y)] > w[j] else -1) for j in FrankWolfe_MinCutSum_Canonical.neighbors(ctx, (x, y))]) \
                    for x in range(ctx.width) for y in range(ctx.height)]).reshape(ctx.width, ctx.height)
        return u_star
class LateralInteractions(nn.Module):
    def __init__(self):
        super(LateralInteractions, self).__init__()

    def forward(self, x):
        # out = x
        batch_size, channels, height, width = list(x.shape)
        out = torch.stack([FrankWolfe_MinCutSum_Canonical.apply(Variable(x[b, c], requires_grad=True)) \
                for b in range(batch_size) for c in range(channels)]).reshape(batch_size, channels, height, width)
        # out[b, c] = FrankWolfe_MinCutSum_Canonical.apply(x[b, c])
        return out

Everything else is as in the original post.

After digging more into it I have come to the conclusion that the comparison of scalar tensors is taking a huge time. Any way how to make this faster?

        return ctx.beta - torch.FloatTensor([ \
                ctx.alpha*np.sum([(1 if w[(x, y)] > w[j] else -1) for j in FrankWolfe_MinCutSum_Canonical.neighbors(ctx, (x, y))]) \
                for x in range(ctx.height) \
                for y in range(ctx.width)]).reshape(ctx.width, ctx.height).to(device)

One of the things is that why are you not doing all the ops on GPU, might give you some speed up.
And again try to remove the for loop and vectorize the operations. Will require some tricks like you will have to unroll the matrix like for each ctx.width create a corresponding row or column of all ctx.weight instead of looping over.

I changed the function into this:

    def relaxed_taylor_closed_form_solution(ctx, w):
        w_shift_left = torch.cat((w[:, 1:], w[:, -1].reshape(-1, 1)), 1)
        w_shift_right = torch.cat((w[:, 0].reshape(-1, 1), w[:, :-1]), 1)
        w_shift_up = torch.cat((w[1:, :], w[-1, :].reshape(1, -1)), 0)
        w_shift_down = torch.cat((w[0, :].reshape(1, -1), w[:-1, :]), 0)
        interactions = ctx.alpha*(torch.sign(w - w_shift_left) + torch.sign(w - w_shift_right) + torch.sign(w - w_shift_up) + torch.sign(w - w_shift_down))
        return ctx.beta - interactions

The execution was still too slow, so I started investigating on parallelization. Currently, I implemented this, which works good for CPU but crashes with GPU:

from joblib import Parallel, delayed
import multiprocessing

class LateralInteractions(nn.Module):
    def __init__(self):
        super(LateralInteractions, self).__init__()

    def forward(self, x):
        batch_size, channels, height, width = x.shape
        num_cores = multiprocessing.cpu_count()
        results = Parallel(n_jobs=num_cores, backend="threading")(delayed(FrankWolfe_MinCutSum_Canonical.apply)(x[b, c]) for b in range(batch_size) for c in range(channels))
        out = torch.stack(results).reshape(batch_size, channels, height, width)
        return out

To give you a general idea, the execution is faster for parallelized CPU than for GPU without parallelization. It would be ideal to make that part of the code to be called in parallel, since the calls are completely independent and only need to be synchronized before passing to the next layer.

1 Like

Thanks a lot! This parallelization method really helps me!