How to speed up following operation (for loop creating different sized matrices)

I’ve got the following mechanism for adding positional awareness to representations, it works well but is very slow.

The fundamental issue is that each sample in the batch has a different real length and I therefore have to loop over and create different embeddings for each of them.

class PositionSymmetricLayer(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.embedding = nn.Parameter(torch.linspace(0., 1., dim//2), requires_grad=False)
        self.dim = dim

    def forward(self, x, durs):
        N = x.size(0)
        T = x.size(1)
        pos_emb = torch.zeros((N, T, self.dim), dtype=torch.float32).cuda()
        for i, dur in enumerate(durs):
            factor = torch.linspace(0, 1, dur).unsqueeze(1).cuda()
            pos = (torch.cat((self.embedding - factor, -self.embedding + factor,), dim=-1) >= 0).float()
            pos_emb[i, -dur:] = pos
        x = torch.cat((x, pos_emb,), dim=-1)
        return x

To help understand what this does you can see in the following what pos is given a dur of 3 and a dim of 8:

tensor([[1., 1., 1., 1., 1., 0., 0., 0.],
        [0., 0., 1., 1., 1., 1., 0., 0.],
        [0., 0., 0., 1., 1., 1., 1., 1.]])

Is there something I can do to speed the for loop up?

Got a large reduction by sorting the samples by duration and batchifying some operations.