Batch-wise processing is quite slow

Suppose I have three vectors A, B, C

A vector size of 256
B vector size of 256
C vector size of 256

Now I want to do concatenation in the following way:

AB= vector size will be 512
AC = vector size will be 512
BC = vector size will be 512

However, I need to restrict all the concatenated vectors to 256, like:

AB= vector size will be 256
AC = vector size will be 256
BC = vector size will be 256

One way is to take the mean of each two values of the two vectors like A first index value and B first index value, A second index value and B second index value … etc. Similarly, in the concatenation of other vectors.

How I implement this:

x # torch.Size([32, 3, 256]) # 32 is Batch size, 3 is vector A, vector B, vector C and 256 is each vector dimension

def my_fun(self, x):
        iter = x.shape[0]
        counter = 0
        new_x = torch.zeros((10, x.shape[1]), dtype=torch.float32, device=torch.device('cuda'))
        for i in range(0, x.shape[0] - 1):
            iter -= 1
            for j in range(0, iter):
                mean = (x[i, :] + x[i+j, :])/2
                new_x[counter, :] = torch.unsqueeze(mean, 0)
                counter += 1
        final_T = torch.cat((x, new_x), dim=0)
        return final_T

ref = torch.zeros((x.shape[0], 15, x.shape[2]), dtype=torch.float32, device=torch.device('cuda'))
for i in range (x.shape[0]):
    ref[i, :, :] = self.my_fun(x[i, :, :])

But this implementation is computationally expensive. One reason is I am iterating batch-wise which makes it computationally expensive. Is there any efficient way to implement this task?