Hi,

When I computed the gradient of a sparse tensor on GPU, I got out of memory issue.

I need a tensor of size `[1, 3, 224 * 224, 224 * 224]`

which only has 1 * 3 * 224 * 224 * 25 nonzero entries. So I store it in sparse_coo format.

I take the summation over the last dim, then I get a tensor of size `[1, 3, 224 * 224]`

:

```
b = torch.sparse.sum(a, dim=3).to_dense()
```

I reshape the tensor into size `[1, 3, 224, 224]`

:

```
b = b.view(1, 3, 224, 224)
```

and then feed it into a neural network.

The memory that required to excute this computation should be very small. The forward pass is fine. However, I encouter an error during backward

```
RuntimeError: CUDA out of memory. Tried to allocate 28.14 GiB
```

It seems like pytorch creates some tensor of size `[1, 3, 224 * 224, 224 * 224]`

during backward, which exactly consume that amount of memory.

I am wondering whether I misuse the sparse tensor functionality such that autodiff construct the full dense matrix explictly during backprop?