Optimization for self outer product

Hi, everyone. This is my first post here.

As in the title, I am currently implement outer product of the same vector. PyTorch’s outer product function seem to only take input as a vector not batch of vectors. After some looking, I use einsum notation function to calculate the outer product of a batch of vectors.
However, since the outer product is applied on the same vector, it is only necessary to calculate a triangle instead of the full matrix. I think this would reduce computational and memory complexity about half.

import numpy as np
import torch


elapsed_times = []

scaler = torch.cuda.amp.GradScaler()

start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

batch_vector = torch.rand((256, 1024)).cuda()

for i in range(10000):
    # print(f'\r{i}', end='')
    start.record()
    with torch.cuda.amp.autocast():
        outter = torch.einsum('bi,bj->bij', batch_vector, batch_vector)
    end.record()
    torch.cuda.synchronize()

    elapsed_time = start.elapsed_time(end)

    elapsed_times.append(elapsed_time)

print()
print(np.sum(elapsed_times))
print(np.mean(elapsed_times))

The code above is the current implementation of outer product without the efficient implementation.
The code finishes in 17822 seconds with an average of 1.8 second per steps.

Could someone help me to implement the efficient version of the outer product?

The reason it’s taking so long is because you’re repeating it 10^4 times. Could you explain more about the 10^4 iterarions part? If you’re doing 10^4 different vectors you could try to do multiple vectors in batch to be faster (subject to memory).

If these vectors are sequentially dependent on one another I think this might be as fast as it can go.

The loop represents an epoch, where I need to calculate outer product of a batch of size 256 of vectors of 1024 in dim.

And my question is about an efficient implementation of the outer product. Because the memory complexity of a normal outer product is N^2, there is a limitation to batch calculation. I wish to have an efficient implementation so I can calculate more per batch.

So I just ran a loop without using AMP and it takes me about 30s on my GPU. Are you sure you’re measuring it correctly? torch.cuda.Event records the time in milliseconds not seconds. See here in the documentation.

You can perform this via matmul although the speed is the same (at least for me). A matmul equivalent would be,

outter = batch_vector.unsqueeze(2) @ batch_vector.unsqueeze(1)

Hi, sorry for the late reply.
I missed a dot in the reported number. It took 17.822 seconds to finish 10000 iteration.

With your suggested operation, it is the same operation, while I am trying to make an efficient implementation of the outer product.
Please note that the inputs to the outer product are same. This means that almost half of the calculation is wasted.

This is (at least to me) the fastest you can perform the outer product. What are you going to use the outer product with? Depending on the circumstances, you might be able to circumvent the outer product entirely (but it depends on what your use case is)