CPU => GPU => CPU automatic memory management

Hello,
I have a very ugly idea that aims to allow me to inference a pretty large pretrained model on my 8GB vram. I would like to convert CPU tensors to GPU immediately before the execution of a tensors-related function and bring the result back to cpu. Is a pretty bottlenecks-creator solution that, anyway, seems theoretically faster than use only CPU. I tried to create a wrapper of the Python class Tensor, but is not a good idea. Now I’m looking for at some point of C++ code of torch where I can intercept all tensor related function where to do this operation without wrapping manually every possible function. For example, the ATen/core/dispatch zone could have an intermediate function that is called to execute these math functions? In a few words, I’m looking for a solution that doesn’t imply to edit too much the torch code, obviously also because I’m not able to manage a similar huge project through its development. Finally, this is only an experiment, but I don’t think to have too many alternative without the possibility to buy a more powerful video card. I hope there is some expert of the torch core source code that can give me an advise about how navigate in the source code and create an unique simple wrapper.

Thank you

P.S.
From a “serious” point of view, I think that this could be a reflection for a real development of another class, a “Memory Stream Management” where the user can activate a similar function, but also choose the behaviour about where use CPU or GPU in basis of the available memory or recurrence of the operation etc. I know that appears dirty create a “tensor reference object” where track the same tensor among different devices, but I think that it could allow a strongest optimization about the use of accelleration devices. Another example could be a middleware function that can be setted to handle the cases where an operation involves tensors allocated on different devices, so the developer can choose where to move what tensor to continue the operation without breaking the code execution. Finally, also the possibility to move automatically a tensor to another device in base of currently available memory could be a good reason to the creation of a TensorRef class, maybe compatible with the interface of Tensor class…