Preallocate memory for function outputs

This question is about efficiency and speed. Say, I need to run a function several times that returns a tensor and want to combine the results into a big tensor.
Here are examples of two ways to do this:

# assume that funtion() returns (16,16) tensor
# first way
a = torch.empty(10, 16, 16)
for i in range(10):
    a[i,:,:] = function(i)

# second way
a = []
for i in range(10):
a = torch.stack(a, 0)

… and there are more ways, e.g. using

First natural question: which way is generally more efficient or faster for memory and/or computation?

Second: in both presented examples, to my knowledge, the function output is first stored as a separate tensor. Then, during stacking or assigning to a slice, it is copied from one memory location to another, which I believe is not efficient. So, is there any way to make the function output be directly stored into a preallocated memory location (in this case, the tensor a's slice)?

UPDATE: Ok, it turns out it is not much possible with custom python functions. So, the question becomes narrower, involving only some PyTorch operations which always output a new tensor, not sharing memory with anything else.

Thanks in advance.

Most or all basic pytorch operations have optional “out” argument, that’s exactly preallocated memory mode. Unfortunately, errors are thrown when tensor arguments require gradients. One workaround is to wrap this in autograd.Function, but this also requires manually writing backward().

1 Like