Using softmax causes out of memory error for CUDA

I am writing the Transformer that Google proposes through their paper “Attention is All You Need”. The problem that I’m running into is that, when I apply a softmax to my final logits before outputting them, I run into the error RuntimeError: CUDA out of memory. Tried to allocate 1.38 GiB (GPU 0; 11.91 GiB total capacity; 8.91 GiB already allocated; 841.06 MiB free; 1.58 GiB cached). The thing is that, I keep track of my tensors by printing out (.element_size() * .nelement()). I see that with and without softmax, my output tensors are around 1.5 GB in size, but when I apply softmax to the outputs, by calling torch.nn.functional.softmax, I get the error above and I’m not sure why. I’m getting this error when I call loss.backward() after calculating loss, but it only happens when I use softmax on the outputs.

The code is really long so I suppose it’s not very pleasant to read through so much code. Does anyone know what could be the problem or any suggestions to try?