Why there is no "in place" version of torch.cumsum()?

Hello!

I have a 1xCxHxW tensor I on GPU. For each channel, I want to compute the corresponding “integral image”. To do so, I use the following code (in PyTorch 1.0.0):

result = I.cumsum(dim=2).cumsum(dim=3)

This line of code is part of a function that compares patches within the same image and requires a lot of memory, as I is float32 with H and W around 3000 each and C around 100. For this reason, I try to use “in place” operations when possible. However, in the code above, the GPU hosts both the tensor associated to I and the one to result, while I am no longer interest in I once result has been generated. Unfortunately, it is seems that no cumsum_() is available, and using del I does not decrease the memory usage in real time according to nvidia-smi. The following workaround seems to work, but is it safe?

torch.cumsum(I, dim=2, out=I)
result = torch.cumsum(I, dim=3, out=I)

Thank you!

Yes, the out= argument is legit.

If you want an in-place method you can submit a feature request at the github page.