Some operations in pyTorch can be done “inplace”, some you can specify an explicit output= variable for, and some you simply have to eat the Tensor/Variable it returns, creating a potentially large number of intermediary arrays that have to come and go every iteration.
In my lab code I encounter all of the above, but the impact difference is not trivial to estimate or measure if you don’t have alternatives.
For example, if I have an operation that outputs 1 GB of data, should I just let Torch allocate a new 1 GB GPU RAM chunk every time, and ‘del’ the previous data from the last iteration, or does that cause slowdowns? If a method with an explicit output is available, would that be better? Why do some methods have these output choices and some don’t?
I’ve read some discussions here about this from a total memory point of view, where you simply need to ‘del’ some Tensors all time time. Does this have a performance impact for example does it create a GPU barrier?
From playing a bit to try and do better than the custom GPU allocator that pytorch is using, it is quite hard to do runtime-wise !
Basically, this 1GB buffer will be reused in the next iteration when you need 1GB again, and so it will not use more memory and the allocation will be super fast. So the fact that torch outputs a new 1GB tensor is not a problem at all, this buffer will be needed anyway if you use the autograd for temporary results.
The only possible problem that the allocator could create is a total memory usage higher than the memory needed for all your tensors (because it creates some holes in the memory). In that case, it is possible that a workload that should use 11.8GB of memory does not actually fit on a gpu with 12GB of memory. But in that case, I would say that you will loose less time by just reducing your batch size (or splitting your batch in two and do two forward backward for a single parameters update) than if you’re playing with freeing the cache of the allocator or deleting Tensors by hand.
Thanks, good to hear someone else has thought about this as well I agree if all that differs is a bit of unnecessary GPU RAM use at the end, it’s only an issue if you’re really shoehorning your model. Currently I’m only using 1.5 GB of 11 GB, so I’m safe
I was more concerned if there indeed was undocumented side-effects like cuda barriers during explicit del’s for example (I got this notion because I thought I saw a speed-up when I removed my del’s from the code but I need to check this again).
Replying to myself here - I did see a discussion on the older Torch issue trackers that their backend did exactly this (sync CUDA) at free, unless the caching memory allocator was enabled. I’m not sure how that translates to pyTorch, if it uses its own memory allocator by default instead of cudas and so get around this problem.
Pytorch always uses the caching allocator, so the del should not have any big performance impact (but they are not really needed either so…).
Emptying the cache of the memory allocator will cause a barrier though, so that should be used with caution.