Hello!
I have a 1xCxHxW
tensor I
on GPU. For each channel, I want to compute the corresponding “integral image”. To do so, I use the following code (in PyTorch 1.0.0):
result = I.cumsum(dim=2).cumsum(dim=3)
This line of code is part of a function that compares patches within the same image and requires a lot of memory, as I
is float32
with H
and W
around 3000 each and C
around 100. For this reason, I try to use “in place” operations when possible. However, in the code above, the GPU hosts both the tensor associated to I
and the one to result
, while I am no longer interest in I
once result
has been generated. Unfortunately, it is seems that no cumsum_()
is available, and using del I
does not decrease the memory usage in real time according to nvidia-smi
. The following workaround seems to work, but is it safe?
torch.cumsum(I, dim=2, out=I)
result = torch.cumsum(I, dim=3, out=I)
Thank you!