Slow loop. better method?

I’ve defined a custom loss function for a regression problem but it contains a loop that is very slow and I was wondering if there is a better way to do this.

I generate a prediction and then I want to find the vector in a table (i.e. embedding layer) which is closest to the residual. I use a loop for this but this is very slow. I wanted to use torch.Tensor.apply_ but this is for cpu only, not gpu. I discovered torch.func.vmap but I’m not sure how to use this over both the table and a minibatch (which seems to be its primary purpose). I should also mention I want to update the table during training so perhaps I need to use requires_grad = True in torch.arange?

Thank you in advance for any advice!

def lossFunction(self, prediction, label):
        residual = (label - prediction)
        for i in torch.arange(self.nCentroids, device='cuda'):
            thisNorm = torch.linalg.vector_norm(residual - self.embedding(i))
            if i == 0:
                bestNorm = thisNorm
                bestIndex = 0
            elif thisNorm < bestNorm:
                bestNorm = thisNorm
                bestIndex = i
        thisNorm = torch.linalg.vector_norm(prediction - self.embedding(i))
        return thisNorm

Hi Mark!

If you can package your embedding-layer table as a pytorch tensor, you can find the
nearest vector with pytorch tensor operations, avoiding an explicit python loop.

Here, self.embedding looks like a function that takes an index (packaged as a
zero-dimensional pytorch tensor). Let me assume that embedding is or can be packaged
as a tensor. In particular, let prediction and residual be vectors of length n and
embedding be a tensor of shape [n, nCentroids]. You can then use a single call to
argmin() instead of the explicit loop.

This appears to be a typo – I assume that this should be:

        thisNorm = torch.linalg.vector_norm (prediction - self.embedding (bestIndex))

Note, we will use broadcasting to compare the residual vector with each of the columns
of the embedding tensor. That is, (residual - embedding).shape) is [n, nCentroids].

Here is a loop-free script:

import torch
print (torch.__version__)

torch.manual_seed (2025)

n = 3
nCentroids = 5

prediction = torch.randn (n)
residual = torch.randn (n)
embedding = torch.randn (nCentroids, n)

bestIndex = torch.linalg.vector_norm (residual - embedding, dim = 1).argmin()
thisNorm = torch.linalg.vector_norm (prediction - embedding[bestIndex])

print ('bestIndex:', bestIndex)
print ('thisNorm:', thisNorm)

And here is its output:

2.6.0+cu126
bestIndex: tensor(3)
thisNorm: tensor(1.4950)

Best.

K. Frank

Thank you KFrank! I have two questions.

In order to make the table trainable, it must be nn.Parameter rather than torch.Tensor, right?

So I created self.table as nn.Parameter but in this line:

bestIndex = torch.linalg.vector_norm (residual - self.table, dim = 1).argmin()

I get this error (batch size is 32 and the table is 65536 x 19):

RuntimeError: The size of tensor a (32) must match the size of tensor b (65536) at non-singleton dimension 0

I could fix the batch size and hardcode the broadcast but I think perhaps I’m missing something important?

Hi Mark!

This would be the typical way to do it. (There would be a connotation that self.embedding
is naturally part of your model, but this is not a requirement.) But you can also train a tensor
that carries requires_grad = True without it being a Parameter.

Take a look at the documentation for linalg.vector_norm() and broadcasting semantics,
in particular how the dim argument works for vector_norm(). Print out the shapes for
residual and self.table and see if everything lines up correctly.

Best.

K. Frank