Efficient tensordot

I’m not sure this is the proper forum for questions of this type, let me know.

  1. I am looking for the PyTorch way of doing the following:
    Given
    a = torch.Tensor([[1, 2, 3], [4, 5, 6]]) # 2 x 3
    b = torch.Tensor([
    [[ 1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]],
    [[12, 11, 10], [9, 8, 7], [6, 5, 4], [ 3, 2, 1]]
    ]) # 2 x 4 x 3
    I would like to get
    r = [[14.0, 32.0, 50.0, 68.0], [163.0, 118.0, 73.0, 28.0]] # 2 x 4
    with the inner result vectors being dot products of a, b sub-vectors as follows:
    r = [
    [a[0] \dot b[0][0], a[0] \dot b[0][1], a[0] \dot b[0][2], a[0] \dot b[0][3]],
    [a[1] \dot b[1][0], a[1] \dot b[1][1], a[1] \dot b[1][2], a[0] \dot b[1][3]]
    ]
    In yet other words, the first dimension of both a and b being the batch dimension, I need to take batch-wisely the first vector transposed, and do standard matrix multiplication with the 3 x 4 array, finally getting two batch-arranged four-element vectors

I tried
r = torch.tensordot(a, b, dims=([1], [2]))
but this produces
[[[14.0, 32.0, 50.0, 68.0], [64.0, 46.0, 28.0, 10.0]], [[32.0, 77.0, 122.0, 167.0], [163.0, 118.0, 73.0, 28.0]]]
i.e. more than I need - I need only the diagonal of it

  1. Is there a good guide to read about PyTorch matrix operations in a systematic way?

  2. Extra: if there is more than one way of doing 1., which would be the most efficient in terms of CUDA operations, if the first dimension is large (hundreds), and two others relatively small (a dozen or two)? Where can I read more about PyTorch / CUDA efficiency?

Hi @Tomasz_Dryjanski ,

have a look at torch.bmm, it computes a matrix multiplication for each corresponding matrix pair.
Here is an example of what you want to do:

import torch

a = torch.Tensor([[1, 2, 3], [4, 5, 6]]).float()
b = torch.Tensor([[[ 1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]],
                  [[12, 11, 10], [9, 8, 7], [6, 5, 4], [ 3, 2, 1]]]).float()
a = a.unsqueeze(1)  # B, 1, 3

print(a.shape, b.shape)
>> torch.Size([2, 1, 3]) torch.Size([2, 4, 3])

c = torch.bmm(a, b.transpose(1, 2))
c = c.squeeze(1)

print(c)
>> tensor([[ 14.,  32.,  50.,  68.],
           [163., 118.,  73.,  28.]])

albanD will hopefully answer your extra question :^)

Edit example for einsum, which albanD mentioned:

a = torch.Tensor([[1, 2, 3], [4, 5, 6]]).float()
b = torch.Tensor([[[ 1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]],
                  [[12, 11, 10], [9, 8, 7], [6, 5, 4], [ 3, 2, 1]]]).float()

torch.einsum("bn, bmn -> bm", a, b)
2 Likes

@Caruso answer above is good for 1.

for 2: I’m afraid there is no such thing. But there are functions to do all the matrix ops that you want in general so hopefully such a guide is not necessary. In general, I use torch.matmul() that performs generic batch matrix multiplication. And add extra dimensions where needed.
Note that sometimes, it is more efficient to do the product reduction by hand and you can do an element-wise product and a sum(dim=[-1, -2]) for example if you need to reduce two dimensions at once.
Finally, if you are interested, we also have a einsum() function that allows you to perform arbitrary reductions specified with Einstein notation.

For 3: PyTorch / CUDA efficiency rules are the same as regular CUDA efficiency rules (that you can find online). The gist is: do only large ops :smiley:

1 Like

Thank you so much! :slight_smile:

About efficiency: is there e.g. any difference between
torch.bmm(a.unsqueeze(1), b.transpose(2, 1)).squeeze(1)
and
torch.bmm(b, a.unsqueeze(2)).squeeze(2)
in terms of operation speed, assuming that b.size()[0] >> b.size()[1] and b.size()[0] >> b.size()[2]
(i.e. the batch size being much bigger than any other dimension)?

In theory no. Mostly because ops like transpose or squeeze don’t actually touch the content of the Tensor. And so won’t even need to run anything on the GPU: they only change the Tensor metadata stored in ram.

In practice, GPU optimization is quite hard and there is no silver bullet there. So you can actually see wildly different behaviors for similar ops just because you transpose or unsqueeze a different dimension.
So you will have to run the code to see which one is faster in practice.
Don’t forget the proper synchronization with torch.cuda.synchronize() to get proper timings on the GPU.

1 Like