I am a part of a research group trying to accelerate certain layers on FPGA. In this case, we need to defined a new .to(device) and define our new device with a function that handles data movement between the accelerator and CPU. I have created a c++ extension for this but we are not so sure how could we define a new .to(device) without any in tree change. Is there any resource or example on that? I also need to have a custom malloc function to allocate the tensor as well. Thank you very much in advance!