Help me to speed up nested loop?

I am using a nested for loop to get outputs from my model for 1000 images compared with themselves. model takes 2 images and outputs a scaler. A is a tensor of size (1000,3,128,128) and B=A. But, it is taking a long time to get the output.

out=[]
for i in A:
        temp=[]
        for j in B:
            temp.append(model(i,j))
        out.append(temp)

Final shape of out is (1000,1000). Above code just gives the overall idea of my task and 1000 is arbitrary.
I am actually using batch of size 8 instead of 1 (as above) to speed up the process. But, this is still taking a lot of time. Can anyone help me with this? I also want to know if there is a method to avoid duplicate comparisons as out matrix is symmetric.

Could you explain the comparison inside your model a bit?
I.e. if you pass two batches of indices [0, 1, 2] and [0, 1, 2], will the model output the scalars for 0-0, 1-1, 2-2` or will it also compute all other combinations?

In the former case, you could increase the inner batch (j) by one while increasing the outer batch by batch_size.

You could add a check for the indices as:

for index_i, i in enumerate(A):
    temp = []
    for index_j, j in enumerate(B):
        if j > i: # or j >= i, if you don't want to compare the same indices
            continue

If you pass my model the batch ([0,1,2],[0,1,2]), then it will output a scalar for each of the case 0-0, 1-1 and 2-2, i.e. my output will be 3 scalars.

As I have mentioned in my question, I am using batch size of 8. I replicate the i 8 times (for each element) and pass it to 8 elements of the inner batch, j. But, this is too slow. I want to optimise this part.

Yes, I could do that, but, is there a way to combine this with the previous optimisation (if found) to further speed up the task.

I’m not sure if there is another way to optimize this.
Assuming you are dealing with N images, your model would have to calculate
N + N-1 + N-2 + ... + 1 = (N+1) * N / 2 comparisons.
You could speed it up using batches, but if I understand your explanation correctly, you are avoiding unnecessary calculations.