GPU running out of memory at only 50% consumption

I’m getting the following error (during the backward pass):
RuntimeError: CUDA out of memory. Tried to allocate 4.93 GiB (GPU 0; 23.65 GiB total capacity; 9.11 GiB already allocated; 10.44 GiB free; 12.34 GiB reserved in total by PyTorch) (malloc at /opt/conda/conda-bld/pytorch_1587428398394/work/c10/cuda/CUDACachingAllocator.cpp:289)

I have exclusive access to this GPU while the process is running. It seems like PyTorch isn’t reserving enough memory? Clearly there is enough free memory (10.44 GiB), unless memory fragmentation is so bad that I can’t use ~50% of the GPU memory. This always happens on the backward pass - I’ve even tried using torch.utils.checkpoint, but it didn’t make much of a difference since the forward pass does not take up nearly as much memory even with grad.
My model uses Transformer layers as well as sparse-dense matrix multiplication, with variable-length sequences, so I do expect some fragmentation, but could it be causing it to this degree when I call loss.backward()? How could I fix this?
Does backpropagation on sparse-dense matrix multiplication w.r.t. the sparse tensor return a sparse tensor? If it is returning a dense tensor, this could be the reason for high memory consumption.