Using pytorch from Cython with nogil for multithreading

Goal: I am interested in doing multithreading with Pytorch and would prefer to use Cython rather than C++ if possible. I would like to take advantage of nogil (no global interpreter lock) in Cython which allows for multithreading in parallel rather than just concurrently.

Problem: The Cython compiler does not allow calls to the Pytorch Python frontend when the gil is released. I think the solution to this would be to call the C++ frontend for Pytorch directly from Cython because C++ code can be called when the gil is released.
Question: Is using Pytorch with nogil in Cython feasible and reasonable? If so could someone point me to an example?
How I got here: I tried using torch.multiprocessing but I discovered that does not work for my use case because I got this error:
RuntimeError: Cowardly refusing to serialize non-leaf tensor which requires_grad, since autograd does not support crossing process boundaries. If you just want to transfer the data, call detach() on the tensor before serializing (e.g., putting it on the queue).

This is because I wanted to evaluate my neural network on the main process and then compute the loss function (which is computationally expensive in my case) on several different batches.