Very huge memory of batch multiplication and addition

I want to do torch.bmm for tensor[12000, 1200, 1] and tensor[12000, 1, 1920], so I get tensor[12000, 1200, 1920], then I want to do torch.sum( , dim = 0) to it, finally I get tensor[1200, 1920].
But the process is very memory-consuming, How can I do it without for loop?

that’s equivalent to a single matrix multiplication: 1200x12000 @ 12000x1920 = 1200,1920