Some operations in pyTorch can be done “inplace”, some you can specify an explicit output= variable for, and some you simply have to eat the Tensor/Variable it returns, creating a potentially large number of intermediary arrays that have to come and go every iteration.
In my lab code I encounter all of the above, but the impact difference is not trivial to estimate or measure if you don’t have alternatives.
For example, if I have an operation that outputs 1 GB of data, should I just let Torch allocate a new 1 GB GPU RAM chunk every time, and ‘del’ the previous data from the last iteration, or does that cause slowdowns? If a method with an explicit output is available, would that be better? Why do some methods have these output choices and some don’t?
I’ve read some discussions here about this from a total memory point of view, where you simply need to ‘del’ some Tensors all time time. Does this have a performance impact for example does it create a GPU barrier?