Forward method engages only 1 CPU core when written in C++, while it engages multiple cores when written in Python


I have a custom pytorch model of which I moved the forward method from python to C++ to speed it up. However, I notice that the pure Python version is engaging multiple cores while training, while the C++ version is engaging only 1 core.

I would like to ask if there is anything that specifically controls this particular behaviour and I need to make sure to add in my code or is it just that my C++ implementation is inefficient?

Code for both Python and C++ versions can be seen below:

Model in Python:

Forward/Backward methods in C++: