I’m not sure this is the proper forum for questions of this type, let me know.
 I am looking for the PyTorch way of doing the following:
Given
a = torch.Tensor([[1, 2, 3], [4, 5, 6]]) # 2 x 3
b = torch.Tensor([
[[ 1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]],
[[12, 11, 10], [9, 8, 7], [6, 5, 4], [ 3, 2, 1]]
]) # 2 x 4 x 3
I would like to get
r = [[14.0, 32.0, 50.0, 68.0], [163.0, 118.0, 73.0, 28.0]] # 2 x 4
with the inner result vectors being dot products of a, b subvectors as follows:
r = [
[a[0] \dot b[0][0], a[0] \dot b[0][1], a[0] \dot b[0][2], a[0] \dot b[0][3]],
[a[1] \dot b[1][0], a[1] \dot b[1][1], a[1] \dot b[1][2], a[0] \dot b[1][3]]
]
In yet other words, the first dimension of both a and b being the batch dimension, I need to take batchwisely the first vector transposed, and do standard matrix multiplication with the 3 x 4 array, finally getting two batcharranged fourelement vectors
I tried
r = torch.tensordot(a, b, dims=([1], [2]))
but this produces
[[[14.0, 32.0, 50.0, 68.0], [64.0, 46.0, 28.0, 10.0]], [[32.0, 77.0, 122.0, 167.0], [163.0, 118.0, 73.0, 28.0]]]
i.e. more than I need  I need only the diagonal of it

Is there a good guide to read about PyTorch matrix operations in a systematic way?

Extra: if there is more than one way of doing 1., which would be the most efficient in terms of CUDA operations, if the first dimension is large (hundreds), and two others relatively small (a dozen or two)? Where can I read more about PyTorch / CUDA efficiency?