Autograd independently on entries of a single tensor

Danimator · April 20, 2025, 1:24am

Hello,

I am trying to do a computation which follows the following pattern:

output = torch.zeros((len,), device=...)
for i in range(len):
    output[i] = do_computation_on(output[:i])

To maintain compatibility with autograd (i.e., avoiding the one of the variables needed for gradient computation has been modified by an inplace operation message), it is necessary to use output[:i].clone().

In the interest of optimizing memory usage, I would like to avoid this call to clone() as it is clearly conceivable to implement the above pattern without any cloning: just maintain a list of singleton tensors and do the computation that way instead of using a single tensor to maintain the whole list, but this clearly sacrifices memory contiguity (cache-friendliness) and slicing properties I’d like to keep.

Is there any way to maintain a list of independent variables (in the eyes of autograd) which are still contiguous in GPU memory? And if so, are there any drawbacks to such an approach?

Any help is greatly appreciated.

P.S.: I tried to do the above using .chunk() (as it returns a tuple of sub-tensors), and it seems PyTorch was one step ahead of me as this does not work either:
RuntimeError: A view was created in no_grad mode and is being modified inplace with grad mode enabled. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one.

KFrank · April 21, 2025, 2:37am

Hi Danimator!

Danimator:

I am trying to do a computation which follows the following pattern:
output = torch.zeros((len,), device=...)
for i in range(len):
    output[i] = do_computation_on(output[:i])
…
maintain a list of singleton tensors and do the computation that way instead of using a single tensor to maintain the whole list, but this clearly sacrifices memory contiguity (cache-friendliness) and slicing properties I’d like to keep.

I don’t believe that there is any way to do what your want.

At a higher level, bear in mind that a Tensor not only contains values, but also other
stuff. In your particular case, you need the connections to the computation graph that
autograd builds and uses.

I believe (but am not sure) that you could build a list of singleton tensors whose contained
values were contiguous in memory (by manipulating the tensors’ storage appropriately), but
I don’t see how you could control the placement of the tensors’ other data, so you would lose
the full version of cache-friendliness that you probably want. Furthermore, pytorch wouldn’t
know that the data values were contiguous in memory, so I don’t see how you could keep
any built-in slicing functionality.

(Also, just to be clear, autograd acts on whole tensors, not on elements of tensors. So to
perform the computation you describe in a way compatible with autograd would seem to
require singleton tensors – even if they were assembled into larger aggregate tensors later.)

Best.

K. Frank

Danimator · April 21, 2025, 3:45am

Hi K. Frank,

Thanks for your detailed response! That’s a bit of a shame to find out, but not the end of the world. I suppose there is nothing more to do then to eat the quadratic cost of all these copies, and hopefully the cache-friendliness (and speed of copies) works out to be better (than the alternative of maintaining a list of singleton tensors) for all reasonable lengths I’d care for.

Best,
Danimator