GPU speed and memory difference between einsum and matmul

Hi Jason!

I can reproduce your observation (on a smaller gpu).

Here is my tweaked version of your test script:

import torch
import time

print (torch.__version__)
print (torch.version.cuda)
print (torch.cuda.get_device_name())
print (torch.cuda.get_device_properties ('cuda').total_memory)

_ = torch.manual_seed (2022)

bs = 8
L = 1024   # reduce size to fit in smaller gpu
dim = 64

tensor1 = torch.randn((bs, L, dim)).to('cuda')
tensor2 = torch.randn((L, L, dim)).to('cuda')

# warmup the GPU -- use actual tensors and operations
for _ in range(5):
    warump_tensor = torch.matmul(tensor1, tensor1.transpose(1, 2))
    warmup_tensor = None
    warmup_tensor = torch.einsum("bld,lrd->blr", tensor1, tensor2)
    warmup_tensor = None
    warmup_tensor = torch.matmul(tensor2, tensor1.unsqueeze(-1)).squeeze(-1)
    warmup_tensor = None

torch.cuda.reset_peak_memory_stats ('cuda')
torch.cuda.synchronize()
start = time.time()
output1 = torch.einsum("bld,lrd->blr", tensor1, tensor2)
torch.cuda.synchronize()
end = time.time()
print('einsum time:', end-start)
print('einsum memory (GB):', torch.cuda.max_memory_allocated('cuda')/10**9)

output1 = None

torch.cuda.reset_peak_memory_stats ('cuda')
torch.cuda.synchronize()
start = time.time()
output2 = torch.matmul(tensor2, tensor1.unsqueeze(-1)).squeeze(-1)
torch.cuda.synchronize()
end = time.time()
print('matmul time:', end-start)
print('matmul memory (GB):', torch.cuda.max_memory_allocated('cuda')/10**9)

output1 = torch.einsum("bld,lrd->blr", tensor1, tensor2)   # recompute einsum result for allclose() check

print('same res?', torch.allclose(output1, output2, atol=1e-5)) # we are using float not double

And here is it’s output:

1.12.0
11.6
GeForce GTX 1050 Ti
4236312576
einsum time: 0.008707761764526367
einsum memory (GB): 0.337641472
matmul time: 0.07655215263366699
matmul memory (GB): 2.48512512
same res? True

I don’t know if its “normal,” but this kind of thing has been seen before.
See, for example:

It might be worth noting that because you are adding a trailing singleton
dimension (unsqueeze (-1)) to tensor1, you are, in effect, performing
a batch of vector dot products rather than a batch of fully general matrix
products.

Computing a batch of dot products is not a rare use case, but pytorch
does not offer a specialized batch-dot-product function. I’ve come to
conclude that einsum() is a perfectly satisfactory way to compute a
batch-dot-product (and it’s what I use by default when the need arises).

(It’s worth noting that there are instances where einsum() – perhaps with
older versions of pytorch – unreasonably underperforms the equivalent
matmul() computation (with various transpose()s and unsqueeze()s
to get the dimensions to line up correctly).)

Idle speculation:

Perhaps matmul()'s performance tuning has been focused on full matrix
products, rather than the “edge” case of batch dot products. This would
hardly excuse matmul()'s underperformance, but might offer a historical
explanation.

Or it might be some glitch in matmul()'s broadcasting support. It might
be interesting to perform the comparison when creating tensor1 with
an explicit trailing singleton dimension, rather than using unsqueeze().
(You could also try adding a leading singleton dimension to tensor2.
You would, of course, still be broadcasting bs over tensor2’s singleton
dimension and I don’t think it would be a fair comparison to avoid such
broadcasting.)

Best.

K. Frank