That might depend on size and device.
I’m afraid it might not always be optimal as it is much more general than other ops.
In this case, you can use broadcasting to get the same result:
In [1]: import torch
In [2]: N, M, D = 10, 20, 30
In [3]: t1 = torch.rand(N,D)
In [4]: t2 = torch.rand(M,D)
In [5]: %timeit t3=torch.einsum('nd,md->nmd',t1,t2)
41.2 µs ± 263 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [6]: %timeit t3=torch.einsum('nd,md->nmd',t1,t2)
41.3 µs ± 325 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [7]: %timeit t3=t1.unsqueeze(1) * t2.unsqueeze(0)
18.4 µs ± 130 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [8]: %timeit t3=t1.unsqueeze(1) * t2.unsqueeze(0)
23.1 µs ± 3.72 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [9]: (t1.unsqueeze(1) * t2.unsqueeze(0) - torch.einsum('nd,md->nmd',t1,t2)).abs().max()
Out[9]: tensor(0.)
Note that the timing may vary wildly for different sizes or if you use a GPU.
Also for other op, the einsum might be faster.