Is there a way to call Pytorch model inside Cuda C kernel

I have a use case like this, it is like an reinforcement learning rollout that the neural network is in charge of producing part of the data d1, and the rest of the data s1 will be created accordingly by observing d1 (simple example, maybe s1 = a_function(d1)), and so on so forth, right now I have a Cuda C kernel that can take care of s1 generation if d1 is given, meaning that this a_function has been implemented in Cuda C, something like this

(d0, s0) ==> nn model ==> d1 ==> Cuda C create s1 = a_function(d1) ==> (d1, s1) ==> nn model .....

You can see that in order to make this work, so far what I can do is every time when I need to create s I have to call the Cuda C kernel, and the code is as a result switching between nn model and Cuda C kernel. This seems not efficient.

Ideally, if Cuda C could call nn model directly, e.g., pass the model pointer or something, then the entire loop above can be sitting inside one Cuda C kernel. The a for loop in a Cuda C kernel can work out everything, but I am not sure if this is possible.

I don’t fully understand the use case.
PyTorch modules are not single CUDA kernels, but are implemented in e.g. Python, call into the C++ backend, and finally dispatch to the corresponding CUDA kernels.
“Calling nn model directly” in a CUDA kernel would assume that “model” is a single kernel.
Could you explain your use case a bit more?