Hello everyone,

I am working on implementing some custom function (and its corresponding module) that will be part of the sequential execution in a layer. When running it I get a very bad performance in terms of speed of training. I even commented the computations and left only the loops and the calls to the custom function, and even then it is still super slow. It could be that the looping is the main cause. This seems unlikely, however, since the size of the tensor is [20, 1, 28, 28] and [20, 32, 12, 12], corresponding to the two layers. The most interesting parts of the code are the following:

```
class FrankWolfe_MinCutSum_Canonical(Function):
@staticmethod
def in_bounds(ctx, coord):
(x, y) = coord
return 0 <= x < ctx.width and 0 <= y < ctx.height
@staticmethod
def neighbors(ctx, coord):
(x, y) = coord
return list(filter(lambda p: FrankWolfe_MinCutSum_Canonical.in_bounds(ctx, p), [(x+1, y), (x, y-1), (x-1, y), (x, y+1)]))
@staticmethod
def relaxed_taylor_closed_form_solution(ctx, w):
u_star = ctx.beta
for x in range(ctx.width):
for y in range(ctx.height):
i = (x, y)
u_star[i] -= ctx.alpha*sum([(1 if w[i] > w[j] else -1) for j in FrankWolfe_MinCutSum_Canonical.neighbors(ctx, i)])
return u_star
@staticmethod
def forward(ctx, beta, alpha=1, tol=1e-6, max_iter=20):
# print(beta)
ctx.height, ctx.width = list(beta.shape)
ctx.alpha, ctx.beta = alpha, beta
v = beta
# for t in range(max_iter):
# w = v.exp().div(v.exp().add(1))
# u_star = FrankWolfe_MinCutSum_Canonical.relaxed_taylor_closed_form_solution(ctx, w)
# gamma_t = 2 / (t + 2)
# v_1 = v
# v = (1 - gamma_t)*v_1 + gamma_t*u_star
# if torch.norm(v - v_1) < tol:
# break
ctx.save_for_backward(v.clone())
return v
@staticmethod
def backward(ctx, grad_output):
v, = ctx.saved_variables
# return v * grad_output
return grad_output
class LateralInteractions(nn.Module):
def __init__(self):
super(LateralInteractions, self).__init__()
def forward(self, x):
out = x
batch_size, channels, _, _ = list(x.shape)
for b in range(batch_size):
for c in range(channels):
# continue
# out[b, c] = FrankWolfe_MinCutSum_Canonical.apply(Variable(x[b, c], requires_grad=True))
out[b, c] = FrankWolfe_MinCutSum_Canonical.apply(x[b, c])
return out
```

I left in the comments so that you can see that right now the function is basically passing the input as output without changing anything. Then the network looks as follows:

```
class ConvNetLat(nn.Module):
def __init__(self, n = 10):
super(ConvNetLat, self).__init__()
self.layer1 = nn.Sequential(
nn.Conv2d(1, 32, kernel_size=7, stride=2, padding=1),
LateralInteractions())
self.layer2 = nn.Sequential(
nn.Conv2d(32, n, kernel_size=7, stride=1, padding=1),
LateralInteractions(),
nn.AdaptiveAvgPool2d(1))
self.log_softmax = nn.LogSoftmax(dim=1)
def forward(self, x):
# [20, 1, 28, 28]
out = self.layer1(x)
# [20, 32, 12, 12]
out = self.layer2(out)
# [20, 10, 1, 1]
out = out.reshape(out.size(0), -1)
out = self.log_softmax(out)
return out
```

What could I do to improve the training performance? This is literally taking hours to train, as is (with commented code and all).

Any help will be greatly appreciated.