Implement cuda kernels for useprivate1 device

can custom device be created in such a away that it uses all operations as per cuda device, except the the one which we override,… or is there any way to create a new custom device on top of CUDA device… which will perform all the operations similar to default except the one which i override

Would extending PyTorch work for your use case of overloading specific functions?