Cuda OOM When using torch.cumsum

I am implementing a model that uses x = torch.cumsum(x, dim = 2) in one of the modules. As I understand it, this operation should be fast and, if done in place, have low memory requirements. However, I get the oom error below. When I try torch.cumsum(x, dim = 2, out = x), there is an error since x requires grad and this operation is apparently not differentiable. Is there any way to compute torch.cumsum() inplace/without blowing up memory requirments?

File "/home/sww/wikipedia_training/mixer_model.py", line 119, in forward
    x = torch.cumsum(x, dim = 2)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.50 GiB (GPU 1; 39.44 GiB total capacity; 23.79 GiB already allocated; 6.73 GiB free; 31.29 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF