Backprop Through Sparse Tensor Is Not Memory Efficient?


When I computed the gradient of a sparse tensor on GPU, I got out of memory issue.

I need a tensor of size [1, 3, 224 * 224, 224 * 224] which only has 1 * 3 * 224 * 224 * 25 nonzero entries. So I store it in sparse_coo format.

I take the summation over the last dim, then I get a tensor of size [1, 3, 224 * 224]:

b = torch.sparse.sum(a, dim=3).to_dense()

I reshape the tensor into size [1, 3, 224, 224]:

b = b.view(1, 3, 224, 224)

and then feed it into a neural network.

The memory that required to excute this computation should be very small. The forward pass is fine. However, I encouter an error during backward

RuntimeError: CUDA out of memory. Tried to allocate 28.14 GiB

It seems like pytorch creates some tensor of size [1, 3, 224 * 224, 224 * 224] during backward, which exactly consume that amount of memory.

I am wondering whether I misuse the sparse tensor functionality such that autodiff construct the full dense matrix explictly during backprop?

1 Like


I’m afraid that the backward of the to_dense() does not create a sparse Tensor. And so it tried to create a full size Tensor :confused:
The support for sparse Tensors is quite limited so we don’t usually create them in the backward to reduce the number of errors like “XXX op is not implemented for sparse Tensors”.


Thanks for clarification!

Hopefully, in my case, the backprop operation is quite simple. I have implemented an customized backprop operation to solve the memory issue.

1 Like