Question About Custom Gradient Functions and CUDA Acceleration

It appears that class torch.autograd.Function supports custom differential operations by way of user-specified forward and backwards Python functions. So if I create a custom derivative/gradient, how is it possible for that to get executed on the GPU if it’s implemented as Python code and not CUDA?

Do all custom autograd forward/backward functions traverse the PCI-E bus and get executed on the host CPU?

Or more importantly, are all autograd functions delegated to the CPU?

Arbitrary user-specified forward/backwards Python functions are run on the CPU. If you use the PyTorch API to operate on your Tensors, each operation has a CUDA version that will be called if your tensor is on a GPU.