Deleting tensors in a list, class, or tuple does not delete the original tensor

xvdp · May 20, 2019, 6:52pm

Im rewriting my question because I have the answer.
Problem: if you have declare cuda tensors in python and you pass them to functions using containers, and then need to delete them, you have to delete all references. see example below

or look at the gist

gist.github.com

https://gist.github.com/xvdp/5ea32d4baff801c8d80960a8fa7b4594

memory_tests.py

"""testing vram in pytorch cuda
every time a variable is put inside a container in python, to remove it completely
one needs to delete variable and container,
this can be problematic when using pytorch cuda if one doesnt clear all containers
Three tests:

>>> python memory_tests list
    # creates 2 tensors puts them in a list, modifies them in place, deletes them
    # in place mod changes original tensors
    # list and both tensors need to be deleted

This file has been truncated. show original

xvdp · May 22, 2019, 7:34pm

I guess the answer is just have to be careful

if you declare

t = torch.randn((1,3,1024,1024),device="cuda")
f = [t]
def something(f):
  f.mul_(0.001)
  return f
f1 = something(f)

to clear cuda memory you have to

del t
del f
del f1

same goes for tuples, dictionaries. I even tried building my own tensorset class in cpp + pybind, still, as soon as it is used in python, passing it to a function and returning it creates a reference. To clear VRAM you have to delete all references.

If anyone has a better solution, please shout.
Thanks

fei.hu · September 15, 2020, 5:24pm

I agree with the above comments. It is so easy to miss some references. It has been more than one year since the original post. I’m wondering If there are any other convenient ways to delete tensors?

xvdp · September 24, 2020, 4:06pm

Latest pytorch has much wider set of cuda management tools, but cleanup to my knowledge is not part of it ( im still on 1.5).

But one could go further than just cleaning up, There’s a lot of things that can be done in cuda much cleaner than in pytorch cuda, for instance common projection ops like, x.mm(x.T) which cannot be done in place, (and if you require grad you couldnt anyway), but results of yet another tensor of the same size when in many occasions you only need half the upper triangle…

or some modification of the eye, that fills all the values, such as torch.eye(m).sub(1/m);
that tensor is just 2 values (1-1/m, -1/m); but if m happens to be large and you are using cuda, its a crapload of memory that you can be leaving around, they eye, the modified eye, and whatever you do with it. If you write it in a cuda loop it could be minimal informatio

Another one that I;d like automated in memory management, is one that I kind of premanage myelf but not fully - Lets say im testing batches of (256,3,256,256) on a bunch of networks, On a TitanRTX I can run those on Resnet18, 34 and 50, but not Rensnet101 or densnets or vgg -
Itd be great to have a quick network eval flagging maximum number of bytes that it will be required to run the batch, or with a tiny bit more math, the batch sizes that it will accept.

These would be nice to have