OOM for self.attention mechanism

We are running a self-attention mechanism with a conv1d network. We have gotten oom error when we multiply the quey tensor with the key tensor. The size of the query and key tensors are [16, 4, 22778]
t_1 = torch.bmm(proj_query, proj_key)

We have tried converting our tensors to half-precision; however, we still get the same error OOM:

RuntimeError: CUDA out of memory. Tried to allocate 15.46 GiB (GPU 0; 14.76 GiB total capacity; 53.88 MiB already allocated; 13.97 GiB free; 80.00 MiB reserved in total by PyTorch)

Do you have any recommendations to fix this issue?

The shape of both tensors shouldn’t be equal, as this would return a shape mismatch error for the matrix multiplication.
Could you recheck the shapes and post it here, please?