How to speed up the running speed of pytorch model

Now, I want to use a relatively low-performance machine (without GPU) to complete real-time face recognition tasks.

At present, I have used OpenCV to complete the facial image reading through the camera and detect the position of the face.

Then in the next step, I want to cut the detected face and input it into the model.
However, at present, the model does not run very fast and cannot complete real-time tasks.

What should I do?

What I think of is to convert the Pytorch model to NumPy, and then use Numda to accelerate.
Is this method feasible? If it is feasible, how to convert the model to numpy and use it?

Thanks for any help.

does the bottle neck come from the face detection or the classification pipeline?
you might benefit from a network that does it all at once

I think it should be the part of face classification.

Thanks for your help.