It appears that class torch.autograd.Function supports custom differential operations by way of user-specified forward and backwards Python functions. So if I create a custom derivative/gradient, how is it possible for that to get executed on the GPU if it’s implemented as Python code and not CUDA?
Do all custom autograd forward/backward functions traverse the PCI-E bus and get executed on the host CPU?
Or more importantly, are all autograd functions delegated to the CPU?