Im rewriting my question because I have the answer.
Problem: if you have declare cuda tensors in python and you pass them to functions using containers, and then need to delete them, you have to delete all references. see example below
t = torch.randn((1,3,1024,1024),device="cuda")
f = [t]
f1 = something(f)
to clear cuda memory you have to
same goes for tuples, dictionaries. I even tried building my own tensorset class in cpp + pybind, still, as soon as it is used in python, passing it to a function and returning it creates a reference. To clear VRAM you have to delete all references.
If anyone has a better solution, please shout.
Latest pytorch has much wider set of cuda management tools, but cleanup to my knowledge is not part of it ( im still on 1.5).
But one could go further than just cleaning up, There’s a lot of things that can be done in cuda much cleaner than in pytorch cuda, for instance common projection ops like, x.mm(x.T) which cannot be done in place, (and if you require grad you couldnt anyway), but results of yet another tensor of the same size when in many occasions you only need half the upper triangle…
or some modification of the eye, that fills all the values, such as torch.eye(m).sub(1/m);
that tensor is just 2 values (1-1/m, -1/m); but if m happens to be large and you are using cuda, its a crapload of memory that you can be leaving around, they eye, the modified eye, and whatever you do with it. If you write it in a cuda loop it could be minimal informatio
Another one that I;d like automated in memory management, is one that I kind of premanage myelf but not fully - Lets say im testing batches of (256,3,256,256) on a bunch of networks, On a TitanRTX I can run those on Resnet18, 34 and 50, but not Rensnet101 or densnets or vgg -
Itd be great to have a quick network eval flagging maximum number of bytes that it will be required to run the batch, or with a tiny bit more math, the batch sizes that it will accept.