Dot product batch-wise

avijit_dasgupta · November 9, 2017, 8:26pm

I have two matrices of dimension (6, 256). I would like to calculate the dot product row-wise so that the dimensions of the resulting matrix would be (6 x 1). torch.dot does not support batch-wise calculation. Any efficient way to do this?

richard · November 9, 2017, 8:38pm

Each row is a vector with 256 elements; what do you mean by dot product? Do you just want to multiply all of those elements together?

SimonW · November 9, 2017, 8:38pm

torch.bmm(A.view(6, 1, 256), B.view(6, 256, 1)) should do the trick!

http://pytorch.org/docs/0.2.0/torch.html#torch.bmm

avijit_dasgupta · November 10, 2017, 4:45am

Yeah! That would do. There is no direct function then, right?

SimonW · November 10, 2017, 7:30am

I don’t think so. Well, this is very efficient because there is no copying to massage the data

ecolss · June 9, 2018, 1:45am

OK, let’s say, on mac osx, cpu,

In [166]: a = th.Tensor(np.random.rand(100000, 10))

In [167]: %timeit th.sum(a*a, dim=1).sum()
2.8 ms ± 39.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [168]: %timeit th.bmm(a.view(-1,1,10), a.view(-1,10,1)).sum()
64.2 ms ± 666 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

bmm is almost 30 times slower, and furthermore bmm makes backward much more slower, what?

Stone · June 9, 2018, 9:58am

In addition, the results are similar on GPU.
For GPU version, bmm invokes separate CUDA kernels for each matrix multiplication, which in this case, 10000 kernel launches. Then the function calling overhead dominates the total computation time.

SimonW · June 9, 2018, 7:32pm

yeah, it might not be specifically optimized for this case. thanks for doing the benchmark.

SimonW · June 9, 2018, 7:32pm

I just checked bmm is a single batched gemm kernel call. It’s not doing 10000 kernel launches.

Stone · June 10, 2018, 2:11am

Yeah you’re right, it uses a single batched kernel. Actually the CPU version is a loop in batch dimension.

github.com

pytorch/pytorch/blob/29849e428cba03f4a9e24a157781ce822512c39e/aten/src/TH/generic/THTensorMath.cpp#L2308


  THTensor_(resizeAs)(result, t);
  if (beta != 0.0) {
    THTensor_(copy)(result, t);
  }
}


THTensor *matrix1 = THTensor_(new)();
THTensor *matrix2 = THTensor_(new)();
THTensor *result_matrix = THTensor_(new)();


for (batch = 0; batch < THTensor_(size)(batch1, 0); ++batch) {
  THTensor_(select)(matrix1, batch1, 0, batch);
  THTensor_(select)(matrix2, batch2, 0, batch);
  THTensor_(select)(result_matrix, result, 0, batch);


  THTensor_(addmm)(result_matrix, beta, result_matrix, alpha, matrix1, matrix2);
}


THTensor_(free)(matrix1);
THTensor_(free)(matrix2);
THTensor_(free)(result_matrix);

bananacode · June 26, 2019, 9:54pm

If anyone came across this post via a google search, I suggest they check out the following github issue:

Given two batches of vectors A,B, it is the fastest to just compute (A*B).sum(-1)

tengerye · December 22, 2019, 8:04am

I add another method using matmul() with transpose(). The order is from faster to slower:

a = torch.rand(2, 4)

%timeit (a*a).sum(1)
# 4.26 µs ± 21.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit torch.matmul(a, a.t()).diag()
# 6.81 µs ± 365 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit torch.bmm(a.view(2, 1, 4), a.view(2, 4, 1)).view(2, 1)
# 16.2 µs ± 156 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

jshtok · June 24, 2020, 5:53pm

In general,
torch.bmm(A.unsqueeze(dim=1), B.unsqueeze(dim=2)).squeeze()