Controlling host to device memory transfer in a custom layer

Is it possible (as in not too complicated) to not only run my custom C++ layer but also control when and how tensors are transferred to and from GPU for that layer?
The only related thing I could find is this paper where the authors were somehow able to make use of unified memory in PyTorch but their code isn’t available and it seems like that had to put in quite some effort to achieve that.

I’m not sure what exactly your use case is, but wouldn’t it work to move the tensors around in the libtorch C++ API or what exactly is missing?

Is it possible to move a tensor partially to the GPU both for the forward and the backward pass?
My use case is that a layer can be too big to fit into the GPU memory so I’m processing it in slices…it might be straightforward to do it manually for the forward pass but I don’t know how to do it for the backward pass.

I’ll assume that it’s a bit too complicated to do or at least too complicated to talk about here and mark this as resolved.