Slow costum cost function

I implemented a new cost function. It is working, but overall full batch loss time exploded (vs. nn.NLLLoss() from 10 ms to 900 ms on MNIST10 dataset).
system: torch 2.5.1, cuda 12, python 3.12.3 on ubuntu 24.04
algo idea (for each x,y entry)

  1. make x-tensor mean-free by offset m (m = mean within non target entries only),
  2. calculate norm(x-m)
  3. a final cosine-sim-like value
    ( I marked the critical lines by !!! )
def cos_loss(x: tt.Tensor, y: tt.Tensor) -> tt.Tensor:  # x.dim=2, y.dim=1
    "custom loss : shifted cosine (by A. Kleinsorge + A. Fauck)"
    # 1 - cos(x-m, yy), |yy|=1, yy=(1,0,0), cos = nn.CosineSimilarity(dim=)
    bs: int = y.numel() # batch_size,  !classes==10!
    # for i in range(bs): xy[i] = x[i][y[i]] # x[y] entries
    xy: tt.Tensor = x[tt.arange(bs), y.int()]  # x.index_select(?)  !!!!
    m: tt.Tensor = (x.sum(dim=1) - xy) * (1.0 / (10 - 1))  # avg all but target
    xmn: tt.Tensor = tt.zeros(bs, device=y.device)  # create out for next line
    for i,(x1,m1) in enumerate(zip(x, m)): xmn[i] = (x1-m1).norm() # (x-m).norm(dim=0)  !!!!
    return 1.0 - ((xy-m) / (xmn + 1e-6)).mean()

Any idea how to speed up the code?
This algo has some nice mathematical properties, but this is another story.
Thanks for reading.
Alex