Large-size dot product but ... Tried to allocate 256.00 GiB

Hi,

I’m trying to compute the following dot product.

x = torch.rand((256, 32, 32, 1, 256)).cuda()
y = torch.rand((256, 1, 1, 256, 1024)).cuda()
z = x @ y

expecting to obtain:
z → torch.Size([256, 32, 32, 1, 1024])

however I get:
RuntimeError: CUDA out of memory. Tried to allocate 256.00 GiB (GPU 0; 10.92 GiB total capacity; 512.00 MiB already allocated; 9.78 GiB free; 514.00 MiB reserved in total by PyTorch)

Is there a way to perform this operation efficiently that can fit in the gpu ram?