Concurrent evaluations using same model on a GPU in C++/CUDA

I have been working on a project recently that uses CUDA functions to perform many evaluations of the same deep neural network model in parallel across many threads of a GPU in a control optimization problem in a C++/CUDA environment. In this setup, each thread gets a different input vector and access to a shared memory array storing the model parameters. The current implementation is very low-level and effectively amounts to performing the matrix linear combination and non-linear activation operations of the neural network using custom CUDA functions. Because of these custom functions, we’re currently limited to standard multilayer perceptron models. We would like to expand our system’s capabilities and incorporate other architectures such as CNNs and RNNs.

Does PyTorch support something like this in a C++/CUDA environment (TorchScript?) where I can load a model of any architecture with over to the GPU and use PyTorch’s math library to perform many forward-pass model evaluations in parallel using different CUDA threads each with a different input vector? I haven’t come across anything online that matches what I’m looking for, but I may just not be searching for the correct thing with the right terminology.

Hopefully this is clear enough to get across what I’m currently doing and looking for. Let me know if I need to clarify to elaborate more on what I’m trying to do.