Dot product batch-wise

Yeah you’re right, it uses a single batched kernel. Actually the CPU version is a loop in batch dimension.

1 Like