You don’t give much of a context, maybe this is obvious but in case not:
You can allocate the matrices on the CPU, then iterate over the first two dimensions, and send the sliced tensors to GPU for the actual matmult (which only computes on the last two dimensions).