Latest pytorch has much wider set of cuda management tools, but cleanup to my knowledge is not part of it ( im still on 1.5).
But one could go further than just cleaning up, There’s a lot of things that can be done in cuda much cleaner than in pytorch cuda, for instance common projection ops like,
x.mm(x.T) which cannot be done in place, (and if you require grad you couldnt anyway), but results of yet another tensor of the same size when in many occasions you only need half the upper triangle…
or some modification of the eye, that fills all the values, such as
that tensor is just 2 values
(1-1/m, -1/m); but if
m happens to be large and you are using cuda, its a crapload of memory that you can be leaving around, they eye, the modified eye, and whatever you do with it. If you write it in a cuda loop it could be minimal informatio
Another one that I;d like automated in memory management, is one that I kind of premanage myelf but not fully - Lets say im testing batches of (256,3,256,256) on a bunch of networks, On a TitanRTX I can run those on Resnet18, 34 and 50, but not Rensnet101 or densnets or vgg -
Itd be great to have a quick network eval flagging maximum number of bytes that it will be required to run the batch, or with a tiny bit more math, the batch sizes that it will accept.
These would be nice to have