Disaggregated Out of Tree Backend

Hi I was looking to create a custom out-of-tree backend to support a disaggregated OS, I wish to essentially replicate the CUDA kernels but remotely using these disaggregated devices, but I’m having a little difficulty regarding understanding a few components of the backend.

My current understanding of the creation of a custom backend is that:

  • A device guard must be created to handle multiple devices in a thread safe manner?
  • The allocator handles getting memory onto the device
  • Each kernel operates on the memory in the device

My current thoughts on how I’d go about implementing this is by using the disaggregated OS’s RPCs for memory handling for the Allocator and then loading the CUDA kernels directly onto the device such that it should just work. In the topology that I’m working with, each device is associated with a single GPU, and each device has operations for loading CUDA kernels, creating CUDA streams, and allocating device or unified memory. I was hoping to understand what a good approach would be for extending PyTorch to support this OS as a backend?

My main concerns with my intended approach are that on the CUDA kernels that are used, several make direct use of the CUDADeviceGuard, I assume changing this to my device guard could be a large problem given that it uses no CUDA device guards directly, (since each device only has a single CUDA device on-board). I was hoping to clarify my assumptions regarding the various components and whether it is feasible to directly copy the CUDA kernels and load them directly onto my remote device?